Analytics/Server Admin Log/Archive/2019
Appearance
2019-12-31
[edit]- 16:25 elukey: re-run webrequest_load 2019/12/30-15 due to (hopefully) temp hive/kerb issues
2019-12-30
[edit]- 13:48 joal: rerun webrequest-load-wf-text-2019-12-30-8 with updated error-thresholds
2019-12-23
[edit]- 15:01 fdans: deploying refinery
- 08:55 joal: Manually killing application_1576512674871_6292 as it's failing
- 07:54 elukey: re-run 2019-12-22-17 again with customized max heap settings (/user/elukey/refinery dir on hdfs)
- 07:53 elukey: re-run 2019-12-22-14 again with customized max heap settings (/user/elukey/refinery dir on hdfs)
- 00:44 elukey: re-run 2019-12-22-17 again with customized max heap settings (/user/elukey/refinery dir on hdfs)
2019-12-22
[edit]- 18:31 elukey: re-run 2019-12-22-14 again with customized max heap settings (/user/elukey/refinery dir on hdfs)
- 17:15 elukey: re-run webrequest_load 2019-12-22-14
2019-12-21
[edit]- 15:54 elukey: re-run webrequest_load 2019-12-21T14 (failed due to mappers ooms)
2019-12-19
[edit]- 18:42 mforns: deployed refinery (corresponding to source v0.0.109)
- 17:53 mforns: deployed refinery-source v0.0.109
2019-12-17
[edit]- 08:22 elukey: re-launch netflow realtime supervisor in Druid Analytics
- 08:10 joal: Kill-restart cassandra-daily-coord-local_group_default_T_mediarequest_per_file to fix the refinery-hive-jar path issue
2019-12-16
[edit]- 20:21 joal: Rerun webrequest-load-wf-text-2019-12-16-13 without dataloss-error threshold after having checked for dataloss (real dataloss, 10^-4percent)
- 15:41 mforns: finished deploying analytics refinery for kerberos migration
- 15:20 mforns: deploying analytics refinery for kerberos migration
- 12:56 joal: Kill all oozie jobs after having dumped their statuses
- 12:26 joal: Reference for killed backfilling mediarequest-per-file job: https://hue.wikimedia.org/oozie/list_oozie_coordinator/0003296-191212123816836-oozie-oozi-C/
- 12:26 joal: Reference for killed backfillin jo
- 12:23 joal: Kill backfilling job for mediarequest-per-file with 2017-07-0[2345] days not done
- 12:22 joal: Rerun cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2019-12-15
- 12:17 elukey: kill netflow realtime druid supervisor as prep step for kerberos
- 11:14 joal: Clean spark-shell drivers on cluster before kerberos
- 10:46 elukey: stop airflow-* on an-airflow1001
- 10:41 elukey: stop jupyterhub on notebook100[3,4] as prep step for kerberos
- 10:38 elukey: kill Nuria's spark shell application masters in Yarn
- 10:17 elukey: stop hadoop-related timers on stat1007
- 10:04 joal: Killing user-app eating all cluster (application_1573208467349_190044)
- 09:05 joal: Rerun webrequest-load-wf-text-2019-12-14-18 with updated error-checking parameters (all false positive)
- 08:49 elukey: re-run webrequest-load 2019-12-14-13 and 2019-12-15-12 with higher mapreduce limits (modified version of refinery on hdfs /user/elukey with https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/557794/)
- 07:22 elukey: stop camus timers as prep step for maintenance (if we'll do it)
2019-12-13
[edit]- 07:42 elukey: execute reset-failed for monitor_refine_mediawiki_job_events on an-coord1001
2019-12-12
[edit]- 18:46 elukey: rsync timers deployed on labstore100[6,7]
- 15:23 elukey: execute systemctl reset-failed monitor_refine_mediawiki_job_events after Andrew's comment on alerts@
- 12:59 elukey: roll restart hadoop workers to pick up the new settings (removed prefer ipv4 false after T240255)
- 12:40 elukey: enable timers on an-coord1001 after maintenance
- 12:39 elukey: restart hive and oozie on an-coord1001 to pick up ipv6 settings
- 11:14 elukey: stop timers on an-coord1001 as prep step for hive/oozie restart
2019-12-11
[edit]- 07:07 elukey: kill/re-run pageview 2019-12-10-17, stuck in whitelist check for hours (https://hue.wikimedia.org/jobbrowser/jobs/job_1573208467349_171800 for more info)
2019-12-10
[edit]- 14:34 elukey: shutdown of stat1004 to check if it can hold a GPU
- 14:08 jbond42: rolling restart of varnishkafaka-webrequest and varnishkafaka-eventloggin
2019-12-05
[edit]- 09:35 elukey: enable timers on an-coord1001 after maintenance
- 09:34 elukey: stop oozie/hive-*; restart mariadb; restart oozie/hive-* on an-coord1001 to pick up explicit_defaults_for_timestamp - T236180
- 09:06 elukey: temporarily stop timers on an-coord1001 to ease the restart of mariadb on an-coord1001
2019-12-04
[edit]- 20:57 milimetric: finished refinery-deploy-to-hdfs from stat1004 but something's broken on stat1007 in the /srv/deployment/analytics/refinery repo
- 20:08 milimetric: deployed refinery source
- 11:36 elukey: restart mariadb on analytics1030 (hadoop test coordinator) to test explicit_defaults_for_timestamp - T236180
2019-12-03
[edit]- 09:48 joal: Kill restart mediawiki-history-load-coord after sqoop re-import of missing tables
2019-12-02
[edit]- 20:27 joal: restart cassandra bundle
- 20:17 joal: Deploying refinery to hdfs - Last for today!
- 20:00 joal: Deploy refinery using scap to fix today deploy (last)
- 19:20 joal: Manually kill cassandra-coord-mediarequest-per-referer-hourly from bundle as it shouldn't exist
- 19:07 joal: restart cassandra bundle after redeployed patch
- 18:40 joal: Deploy refinery onto hdfs
- 18:39 joal: Deploy refinery using scap for fixes
- 16:43 joal: Restarting cassandra bundle after deploy
- 11:40 joal: Restart mediawiki-geoeditors-monthly-coord
- 11:39 joal: Drop wmf.geoeditors_daily table and create wmf/editors_daily, moving underlying data and recreating partitions
- 11:35 joal: Kill mediawiki-geoeditors-monthly-coord before updating the jobn
- 10:30 joal: Manually sqoop tables not yet done because of late deploy (content_models, content, slots, slot_roles, wbt_entity_usage)
- 10:21 joal: Create new tables for newly sqooped data in hive wmf_raw database
- 09:43 joal: Deploying refinery onto HDFS
- 09:22 joal: Deploy refinery using scap
2019-11-27
[edit]- 09:16 elukey: apply systemd user limits to stat1005
- 07:10 elukey: apply systemd user limits to stat1006,stat1007 and notebook100*
2019-11-26
[edit]- 17:19 elukey: add systemd user limits to stat1004
2019-11-25
[edit]- 13:27 elukey: set global read_only=1 on db1108's log database
2019-11-21
[edit]- 20:07 mforns: deploying refinery to add pageview whitelist changes and stop alerts
- 15:50 mforns: deployed refinery (with v0.0.107)
- 15:10 mforns: deployed refinery-source v0.0.107
- 06:59 elukey: restart hdfs-cleaner on an-coord1001
2019-11-19
[edit]- 19:00 elukey: regenerate TLS cert for yarn.wikimedia.org (containing SANs for all analytics UIs) to add datasets.w.o SAN (site was failing due to ATS not being able to contact thorium)
- 13:54 joal: Deleting 600 more log-folders from analytics user (cassandra backfilling logs) -- T238648
- 13:46 joal: Deleting old parquet wikitext data (new data is stored in Avro) -- T238648
- 13:46 joal: Deleting 100 heavier log-folders from analytics user (cassandra backfilling logs) -- T238648
- 07:51 elukey: restart hdfs-cleaner on an-coord1001
2019-11-18
[edit]- 20:03 joal: Rerun failed mediawiki_wikitext_history oozie job (2019-10)
2019-11-16
[edit]- 09:44 elukey: systemctl restart hadoop-* on analytics1077 after oom killer
2019-11-15
[edit]- 17:05 elukey: restart hdfs-cleaner (failed due to tmp hive files not present when deleting)
2019-11-14
[edit]- 16:09 elukey: roll restart presto-server on an-presto* to pick up new openjdk
- 08:47 joal: starting hdfs-cleaner manually after after failure earlier this night
- 08:37 fdans: initiating bacfilling of daily top mediarequests from the mediacounts database - May 2018 to May 2019
2019-11-12
[edit]- 16:52 elukey: forced a purge in Varnish for the stats.wikimedia.org front page to pick up the new deprecation banner
- 15:42 fdans: manually overwriting index.html in Wikistats 1 to apply patch https://gerrit.wikimedia.org/r/#/c/analytics/wikistats/+/550338/
2019-11-08
[edit]- 12:34 elukey: roll restart cassandra on aqs to pick up new openjdk upgrades
- 12:22 elukey: restart oozie and hive daemons on an-coord1001
- 09:05 elukey: roll restart druid daemons on druid public to pick up the new jvm
- 08:34 elukey: roll restart druid daemons on druid analytics to pick up the new jvm
- 08:34 elukey: restart kafka on kafka-jumbo1001 to test opendjk
2019-11-07
[edit]- 17:33 elukey: restart zookeeper on druid nodes for jvm upgrades
- 17:33 elukey: restart all jvms on hadoop test workers
- 15:41 elukey: roll restart all jvms on Hadoop Analytics Workers to pick up the new jvm
- 12:18 joal: Deleting stat1007:/srv/reportupdater/output/metrics/reference-previews/baseline.tsv as asked by awight
2019-11-06
[edit]- 22:36 milimetric: successfully restarted webrequest bundle, webrequest druid daily, and webrequest druid hourly
- 21:21 milimetric: restarting webrequest load bundle and druid loading jobs
- 21:21 milimetric: refinery deployed, hdfs cleaner and tls ready to be restarted
- 20:16 joal: Kill-rerun pageview-hourly-wf-2019-11-6-13 for being stuck in whitelist-check
- 18:13 joal: restart refinery-import-page-current-dumps.service to test after yestardays failure
2019-11-05
[edit]- 21:06 joal: restarting oozie jobs after spark 2.4.4 upgrade
- 21:04 ottomata: re-enabling refine jobs after spark 2.4.4 upgrade
- 20:57 joal: Starting denoramlize-check one month in advance to enforce a running job with new spark
- 20:37 ottomata: roll restarting hadoop-yarn-nodemanagers to pick up spark 2.4.4 shuffle lib
- 20:21 ottomata: install spark 2.4.4-bin-hadoop2.6-1 cluster wide using debdeploy - T222253
- 20:18 joal: Deploying refinery onto HDFS
- 20:12 ottomata: stopped refine jobs for Spark 2.4 upgrade - T222253
- 20:09 joal: Deploying refinery using scap with missing patch
- 20:00 joal: Deploying refinery using scap
- 18:49 joal: Make Jenkins release refinery-source v0.0.105 to archiva
- 17:12 ottomata: 2019-11-05T17:11:50.239 INFO HDFSCleaner Deleted 872360 files and directories in tmp
- 17:01 ottomata: first run of HDFSCleaner on /tmp, should delete files older than 31 days
- 11:00 fdans: testing load of top metric from mediarequests with corrected quotemarks escaping
2019-11-04
[edit]- 23:28 milimetric: deployed refinery
- 14:58 joal: restarting AQS using scap after snapshot bump (2019-10)
2019-10-31
[edit]- 19:45 fdans: (actually no, no need)
- 19:43 fdans: (changing jar version first)
- 19:43 fdans: restarting mediawiki-history-wikitext
- 19:42 fdans: refinery deployment complete
- 19:17 fdans: updating jar symlinks to 0.0.104
- 17:59 fdans: deploying refinery
- 17:49 fdans: deplying refinery-source 0.0.104
- 16:36 elukey: restart oozie and hive-server2 on an-coord1001 to pick up new new TLS mapreduce settings
- 15:31 joal: Rerun webrequest jobs for hour 2019-10-31T14:00 after failure
- 14:53 elukey: enabled encrypted shuffle option in all Hadoop Analytics Yarn Node Managers
- 10:17 elukey: deploy TLS certificates for MapReduce Shufflers on Hadoop worker nodes (no-op change, no yarn-site config)
2019-10-30
[edit]- 15:00 ottomata: disabling eventlogging-consumer mysql on eventlog1002
- 08:31 joal: Rerun failed cassandra-daily-coord-local_group_default_T_mediarequest_per_file days: 2019-10-26, 2019-10-23 and 2019-10-22
- 06:30 elukey: re-run cassandra-coord-pageview-per-article-daily 29/10/2019
2019-10-29
[edit]- 08:51 fdans: starting backfilling for per file mediarequests for 7 days from Sep 15 2015
- 07:09 elukey: roll restart java daemons on analytics1042, druid1003 and aqs1004 to pick up new openjdk upgrades
2019-10-28
[edit]- 10:10 fdans: mediarequest per file backfilling suspended
- 09:14 elukey: manual re-run of cassandra-coord-pageview-per-article-daily - 26/10/2019 - as attempt to see if the error is reproducible or not (timeout while inserting into cassandra)
2019-10-24
[edit]- 13:54 fdans: running top mediarequest backfill from 2015-01-02 to 2019-05-01
2019-10-23
[edit]- 18:59 milimetric: refinery deployment re-done to fix my mistake
- 18:37 mforns: refinery deployment done!
- 18:31 mforns: deploying refinery with refinery-deploy-to-hdfs up to 1110d59c3983bcff4986bce1baf885f05ee06ba5
- 18:21 mforns: deploying refinery with scap up to 1110d59c3983bcff4986bce1baf885f05ee06ba5
2019-10-22
[edit]- 15:47 fdans: start backfilling of mediarequests per file from 2015-01-02 to 2019-05-17 after ok vetting of 2015-01-01
2019-10-18
[edit]- 14:45 fdans: backfilling 2015-1-1 for mediarequests per file, proceeding with all days until 2019-05-17 if successful
2019-10-17
[edit]- 18:01 elukey: update librdkafka on eventlog1002 and restart eventlogging
- 10:26 elukey: rollback eventlogging back to Python 2, some errors (unseen in tests) logged by the processors
- 10:18 elukey: move eventlogging to python 3
2019-10-16
[edit]- 20:27 ottomata: upgrading to spark 2.4.4 in analytics test cluster
- 20:20 joal: Kill-restart mediawiki-history-dumps-coord to pick up changes
- 20:16 joal: Deployed refinery onto HDFS
- 20:08 joal: Deployed refinery using scap
- 19:45 joal: Refinery-source v0.0.103 released to refinery
- 19:29 joal: Ask jenkins to release refinery-source v0.0.103 to archiva
- 19:19 joal: AQS deployed with mediarequest-top endpoint
- 18:45 joal: Manually create mediarequest-top cassandra keyspace and tables, and add fake test data into it
2019-10-15
[edit]- 13:15 elukey: re-enable timers on an-coord1001
- 12:57 fdans: resumed backfilling of mediarequests per referer daily
- 12:46 elukey: moved hadoop cluster to new zookeeper cluster
- 11:25 elukey: stop all systemd timers on an-coord1001 as prep step for hadoop maintenance
- 10:42 fdans: backfilling January 1st 2015 for mediarequests per referer daily, proceeding with all days until May 2019 if successful
2019-10-14
[edit]- 18:13 joal: Manually add ban.wikipedia.org to pageview whitelist (T234768)
- 14:28 elukey: matomo upgraded to 3.11 on matomo1001
2019-10-11
[edit]- 12:51 elukey: deployed eventlogging python3 version in deployment-prep
- 07:09 elukey: drop test_wmf_netflow fro druid analytics and restart turnilo
- 06:24 elukey: remove /tmp/hive-staging_hive_(2017|2018)* data from HDFS instead of /tmp/* to avoid causing hive failures (it needs to write temporary data for the current running jobs)
- 06:04 elukey: delete content of /tmp/* on HDFS
2019-10-10
[edit]- 09:13 joal: rerun failed pageview hour after manual job killing (pageview-hourly-wf-2019-10-9-19)
- 09:13 joal: Kill stuck oozie launcher in yarn (application_1569878150519_43184)
2019-10-09
[edit]- 20:52 milimetric: deploy of refinery and refinery-source 0.0.102 finally seems to have finished
- 19:55 milimetric: refinery ... probably? deployed with errors like "No such file or directory (2)\nrsync error"
- 17:11 elukey: restart druid-broker on druid100[5-6] - not serving data correctly
2019-10-08
[edit]- 09:22 elukey: delete druid old test datasource from the analytics cluster - test_kafka_event_centralnoticeimpression
2019-10-07
[edit]- 17:46 ottomata: powercycling stat1007
- 06:08 elukey: upgrade python-kafka on eventlog1002 to 1.4.7-1 (manually via dpkg -i)
2019-10-05
[edit]- 18:18 elukey: kill/restart mediawiki-history-reduced oozie coord to pick up the new druid_loader.py version on HDFS
- 06:49 elukey: force umount/remount of /mnt/hdfs on an-coord1001 - processes stuck in D state, fuser proc consuming a ton of memory
2019-10-04
[edit]- 16:27 ottomata: manually rsyncing mediawiki_history 2019-08 snapshot to labstore1006
2019-10-03
[edit]- 14:17 elukey: stop the Hadoop test cluster to migrate it to the new kerberos cluster
- 13:26 elukey: re-run refinery-download-project-namespace-map (modified with recent fixes for encoding and python3)
- 09:48 elukey: ran apt-get autoremove -y on all Hadoop workers to remove old Python 2 deps
- 08:43 elukey: apply 5% threshold to the HDFS balancer - T231828
- 07:48 elukey: restart druid-broker on druid1003 (used by superset)
- 07:47 elukey: restart superset to test if a stale status might cause data not to be shown
2019-10-02
[edit]- 21:21 nuria: restarting superset
- 16:18 elukey: kill duplicate of oozie pageview-druid-hourly coord and start the wrongly killed oozie pageview-hourly-coord (causing jobs to wait for data)
- 13:12 elukey: remove python-request from all the hadoop workers (shouldn't be needed anymore)
- 13:08 elukey: kill/start oozie webrequest druid daily/hourly coords to pick up new druid_loader.py version
- 13:04 elukey: kill/start oozie virtualpageview druid daily/monthly coords to pick up new druid_loader.py version
- 12:54 elukey: kill/start oozie unique devices per family druid daily/daily_agg_mon/monthly coords to pick up new druid_loader.py version
- 10:24 elukey: restart unique dev per domain druid daily_agg_monthly/daily/montly coords to pick up new hdfs version of druid_loader.py
- 10:15 elukey: re-run unique devices druid daily 28/09/2019 - failed but possibly no alert was fired to analytics-alerts@
- 09:48 elukey: restart pageview druid hourly/daily/montly coords to pick up new hdfs version of druid_loader.py
- 09:45 elukey: restart mw geoeditors druid coord to pick up new hdfs version of druid_loader.py
- 09:41 elukey: restart edit druid hourly coord to pick up new hdfs version of druid_loader.py
- 09:38 elukey: restart banner activity druid daily/montly coords to pick up new hdfs version of druid_loader.py
- 08:31 elukey: kill/restart mw check denormalize with hive2_jdbc parameter
2019-09-30
[edit]- 21:05 ottomata: rolling restart of hdfs namenode and hdfs resourcemanager to take presto proxy user settings
- 05:26 elukey: re-run manually pageview-druid-hourly 29/09T22:00
2019-09-27
[edit]- 06:44 elukey: clean up files older than 30d in /var/log/{oozie,hive} on an-coord1001
2019-09-26
[edit]- 18:42 mforns: finished deploying refinery using scap (together with refinery-source 0.0.101)
- 18:27 mforns: deploying refinery using scap (together with refinery-source 0.0.101)
- 17:33 elukey: run apt-get autoremove on stat* and notebook* to clean up old python2 deps
- 15:01 mforns: deploying analytics/aqs using scap
- 13:04 elukey: removing python2 packages from the analytics hosts (not from eventlog1002)
- 11:13 mforns: deployed analytics-refinery-source v0.0.101 using Jenkins
- 05:47 elukey: upload the new version of the pageview whitelist - https://gerrit.wikimedia.org/r/539225
2019-09-25
[edit]- 13:37 elukey: move the Hadoop test cluster to the Analytics Zookeeper cluster
- 08:37 elukey: add netflow realtime ingestion alert for Druid
- 06:02 elukey: set python3 for all report updater jobs on stat1006/7
2019-09-24
[edit]- 14:46 ottomata: temporarily disabled camus-mediawiki_analytics_events systemd timer on an-coord1001 - T233718
- 13:18 joal: Manually repairing wmf.mediawiki_wikitext_history
- 06:07 elukey: update Druid Kafka supervisor for netflow to index new dimensions
2019-09-23
[edit]- 20:56 ottomata: created new camus job for high volume mediawiki analytics events: mediawiki_analytics_events
- 16:46 elukey: deploy refinery again (no hdfs, no source) to deploy the latest python fixes
- 09:25 elukey: temporarily disable *drop* timers on an-coord1001 to verify refinery python change with the team
- 08:24 elukey: deploy refinery to apply all the python2 -> python3 fixes
- 07:44 elukey: restart manually refine_mediawiki_events on an-coord1001 with --since 48 to force the refinement after camus backfilled the missing data
- 07:41 elukey: manually applied https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/538235/ on an-coord1001
- 06:21 elukey_: restart camus mediawiki_events on an-coord1001 with increased mapreduce heap size
2019-09-21
[edit]- 09:00 fdans: resumed per file mediarequests backfiling coordinator
2019-09-20
[edit]- 17:04 elukey: restart yarn/hdfs daemons on analytics1045
- 17:01 elukey: remove /var/lib/hadoop/j from analytics1045 due to a broken dis
2019-09-19
[edit]- 13:31 joal: Kill-restart webrequest-load bundle to fix queue issue
- 10:37 elukey: manually rollback /srv/deployment/analytics/refinery/bin/refinery-drop-hive-partitions to "#!/usr/bin/env python" on stat1007
- 09:16 fdans: starting load to cassandra of mediarequests per file daily
2019-09-18
[edit]- 19:23 joal: Deploy AQS using scap - Try 3
- 18:59 joal: Deploy AQS using scap - Try 2
- 18:53 joal: Deploy AQS using scap
- 18:16 joal: Start mediawiki-history-dumps oozie job starting with August 2019
- 18:10 joal: Kill-restart webrequest-load oozie job to pick-up new ua-parser
- 18:09 joal: Restart eventlogging with new ua-parser (ottomata did)
- 16:46 elukey: manually restarted the refinery-drop-older-than jobs
- 16:45 elukey: manually set "#!/usr/bin/env python" for refinery-drop-older-than on an-coord1001 to restore functionality (minor bug encountered)
- 13:41 joal: Deploy refinery to hdfs
- 13:35 joal: Deploying refinery using scap
- 12:54 elukey: re-run webrequest-load upload/text for hour 11 due to transient hive server socket failures
- 12:39 joal: Release refinery-source v0.0.100 to archiva
2019-09-17
[edit]- 08:19 elukey: manually decommed analytics1032 for hdfs/yarn on the Hadoop testing cluster - T233080
- 07:50 joal: Manually released com.github.ua-parser/uap-java 1.4.4-core0.6.9~1-wmf to archiva
2019-09-16
[edit]- 12:41 elukey: rebooting the hadoop test cluster with the new spicerack cookbook as test
- 10:04 elukey: disable puppet on an-coord1001 and manually forcing python3 for camus - T204735
- 07:25 joal: Delete matomo error with URL http://Wikipedia/screen/Explore
2019-09-13
[edit]- 16:57 joal: Reset ua-parser/uap-java wmf branch to up-to-date master using push force
2019-09-12
[edit]- 09:35 elukey: drop old database 'superset' from analytics-meta (an-coord1001) after a precautionary backup
2019-09-11
[edit]- 18:42 nuria: deployment of v0.0.99 to cluster succeeded, letting it bake for a bit
- 18:14 nuria: deployment of v0.0.99 of refinery that includes quite a bit of cleanup
- 08:33 elukey: stat1005 upgraded with ROCm 2.7.1
2019-09-10
[edit]- 21:34 ottomata: restarting archiva service on archiva1001
- 18:57 joal: Manually fixed dewiki wikitext for snapshot=2019-07 (snapshot is now full and complete despite oozie error)
2019-09-04
[edit]- 15:55 joal: Deploy refinery using scap (fix for SLAs)
- 08:46 joal: Fix mediacounts-archive SLA and kill-restart job
2019-09-03
[edit]- 19:21 fdans: finished restart of all hosts, 2019-08 snapshot deployed
- 19:18 fdans: restarting service on aqs1004
- 15:20 fdans: creating test Cassandra keyspace "local_group_default_T_request_per_file_TEST"
- 13:43 joal: Deploying refinery using scap for fixes
- 13:30 joal: Kill-restart mediawiki-history jobs (denormalize, check_denormalize, reduced, metrics, wikitext)
- 13:26 joal: Kill-restart geoeditors jobs (monthly, yearly and druid)
- 13:21 joal: Kill-restart edit jobs (hourly and druid)
- 13:06 joal: Kill-restart unique_devices per_project_familly jobs (daily, monthly, druid daily, druid daily aggregated monthly, and druid monthly)
- 13:00 joal: Kill-restart unique_devices per_domain jobs (daily, monthly, druid daily, druid daily aggregated monthly, and druid monthly)
- 12:43 joal: Kill-restart mobile-apps jobs (app-session, uniques daily and monthly)
- 12:34 joal: Kill-restart virtualpageview druid jobs (daily and monthly)
- 12:29 joal: Kill-restart wikidata jobs (article_placeholder, coeditors, specialentitydata)
- 12:28 joal: Fix interlanguage for naming convention
- 12:24 joal: Kill-restart interlanguage job
- 12:20 joal: Kill restart browser-general and clickstream
- 12:17 joal: Manually create success-files for banner_activity monthly to start
- 12:11 joal: Kill-restart banner-activity jobs (daily and monthly)
- 12:08 joal: Restart mediarequest job after hotfix (renaming) and needed ops (table change and data move)
- 11:43 joal: Kill-restart virtualpageview_hourly
- 11:42 joal: Kill mediarequest-hourly (more ops to do before restarting)
- 11:39 joal: Kill-restart mediacount jobs (load and archive)
- 11:33 joal: Kill-restart pageview-druid jobs (hourly, daily, monthly)
- 11:29 joal: Kill restart projectview_geo job
- 11:28 joal: Kill restart projectview_hourly job
- 11:25 joal: Kill-restart webrequest-druid jobs (hourly and daily)
- 11:16 joal: Kill-restart pageview_hourly, aqs_hourly and apis jobs
- 11:10 joal: Fixing data-quality bundle and coord for restart
- 11:05 joal: Kill-restart data-quality bundle
- 11:01 joal: Kill-restart cassandra bundle (beginning of month)
- 10:56 joal: Hotfixing webrequest-load job to prevent redeploying
- 10:50 joal: Kill/restart webrequest bundle
- 08:21 joal: Kill-restart mediawiki-load and geoeditors-load jobs after corrective deploy
- 08:10 joal: Deploy refinery onto HDFS
2019-08-27
[edit]- 20:46 ottomata: rolling back to jupyterlab version 0.32.1, 1.0.x is not compatible with Stretch's version of nodejs - T230724
- 16:02 mforns: restarted turnilo to apply changes to config
2019-08-26
[edit]- 19:06 ottomata: update spark2 package to -4 version with support for python3.7 across cluster. T229347
- 11:30 joal: Remove tracking failure for http://Wikipedia/screen/Explore in matomo
2019-08-23
[edit]- 12:24 joal: Rerunning refine for eventlogging-analytics for 2019-08-23T03:00
- 09:38 elukey: restart hive-server2 on an-coord1001 to pick up new settings - T209536
- 08:06 joal: Launch mediawiki-history-dump test from marcel forlder
2019-08-22
[edit]- 15:13 elukey: remove reading_depth druid load job from an-coord1001
- 14:20 joal: Start mediarequests oozie coordinator from 2019-08-14T12:00
- 13:41 joal: Deploying refinery onto hdfs
- 13:18 joal: Deploying refinery with scap
- 12:48 joal: Releasing refinery v0.0.98 on archiva from jenkins after correction
- 12:10 joal: Release refinery-source v0.0.98 to archiva (correction)
- 12:10 joal: Release refinery-source v0.0.98 to jenkins
- 09:34 elukey: clean up on the oozie db loop_* workflows (oozie stuck for some reason, most of the coords not processing anything since hours)
- 08:35 joal: Restart webrequest bundle
- 08:32 joal: Manually kill all leftover workflows from mediawiki-history-dumps
- 08:29 joal: Kill webrequest-load bundle
- 08:18 moritzm: restarted oozie on an-coord1001
- 07:32 joal: Rerun webrequest-load for text and upload, hours 21 and 22
- 07:28 joal: Suspend/resume stalled coordinators in hue
2019-08-21
[edit]- 14:27 elukey: swap turnilo backend in varnish from analytics-tool1002 to an-tool1007
2019-08-20
[edit]- 06:57 elukey: drop wmf_netflow from Analytics druid and restart the job with more dimensions
2019-08-14
[edit]- 17:58 fdans: backfilling mediarequests from 2019-5-16 to 2019-8-14
2019-08-13
[edit]- 08:47 elukey: kill/restart mediawiki geoeditors|history load, history wikitext to pick up chances to the repair workflow (hive2 actions)
- 08:40 elukey: kill/restart mediawiki geoeditors druid/monthly to pick up hive2 actions
- 08:40 elukey: kill/restart mediawiki history metrics/reduced to pick up hive2 actions
- 06:23 elukey: kill/restart oozie coord unique_devices per_project_family daily due to missing jdbc_url in coordinator.properties (for hive2 actions)
2019-08-12
[edit]- 22:00 mforns: restarted projectview geo coordinator in oozie
- 21:55 mforns: restarted all Unique Devices coordinators (except cassandra ones) in oozie
- 21:28 mforns: restarted all Virtualpageview coordinators in oozie
- 21:14 mforns: restarted Webrequest druid coordinators in oozie
- 21:06 mforns: restarted Webrequest bundle in oozie
- 19:32 mforns: Finished deployment of analytics-refinery up to 5418d3be5f65f7325324d0c15c51b3ca722dde1c
- 18:33 mforns: Starting deployment of analytics-refinery up to 5418d3be5f65f7325324d0c15c51b3ca722dde1c
2019-08-09
[edit]- 06:36 elukey: restart oozie coords to pick up new hive2 actions (edit hourly, pageview druid daily/hourly/monthly, mobile apps uniques daily/monthly)
2019-08-08
[edit]- 17:47 fdans: refinery deploy successful
- 17:33 fdans: scap deploy of refinery done
- 16:40 fdans: deploying refinery
- 16:38 fdans: updating jars
- 16:25 fdans: releasing refinery-source 0.0.97 to Maven
- 16:15 fdans: restarting oozie coordinator pageview-druid-monthly-coord
- 16:14 fdans: restarting oozie coordinator pageview-druid-daily-coord
- 16:13 fdans: restarting oozie coordinator pageview-druid-hourly-coord
- 16:09 fdans: restarting oozie coordinator mobile_apps-uniques-daily-coord
- 16:08 fdans: restarting oozie coordinator mobile_apps-uniques-monthly-coord
- 16:02 fdans: restarting edit_hourly
2019-08-02
[edit]- 14:28 mforns: restarting oozie bundle for cassandra and oozie coordinator for edit_hourly
- 14:27 mforns: finished deploying refinery
- 13:57 mforns: deploying refinery up to b50a93955952ed863d5ef7703a91ab59f5d979cf (rollback of cassandra and edit_hourly hive2 actions to unbreak production)
- 13:26 elukey: kill/start edit hourly oozie coordinator as attempt to fix a recurrent failure
- 08:52 elukey: manually created /tmp/hive/operation_logs on an-coord1001
2019-07-31
[edit]- 18:06 mforns: deployed Wikistats2 version 2.6.5
- 16:04 mforns: finished deployment of analytics-refinery up to eb2d9b005b26f6dddab2b59f1ba591f1758ec99f
- 15:37 mforns: starting deployment of analytics-refinery up to eb2d9b005b26f6dddab2b59f1ba591f1758ec99f
- 12:58 elukey: roll restart zookeeper on druid clusters with spicerack cookbook
- 08:06 elukey: increase heap size on HDFS Namenodes (an-master100[12]) to 16G
2019-07-30
[edit]- 08:14 mforns: restarted hive-server2
- 08:14 mforns: restarted hive-metastore
- 07:59 mforns: restarted oozie in an-coord1001.eqiad.wmnwt
2019-07-29
[edit]- 15:34 elukey: roll restart kafka brokers on Jumbo with spicerack
- 13:01 elukey: roll restart druid jvms on druid100[4-6] via spicerack cookbook
- 08:56 elukey: roll restart druid jvms on druid100[1-3] via spicerack cookbook
- 06:31 elukey: restart the hadoop workers' jvms via spicerack cookbook
2019-07-26
[edit]- 07:42 elukey: restart mediacounts-load hourly coordinator after refinery deployment to hdfs
- 07:33 elukey: restart browser-general daily coordinator to pick up hive2 settings
- 07:31 elukey: restart banner_impressions daily coordinator to pick up hive2 settings
- 07:27 elukey: restart aqs coordinator to pick up hive2 settings
- 07:18 elukey: deploy last version of refinery to HDFS
- 06:28 elukey: restart aqs coordinator with hive2 actions
2019-07-25
[edit]- 20:47 nuria: restarting banner_activity/druid/daily
- 20:26 nuria: restarting browser-general oozie job
- 16:58 elukey: restart the hdfs datanode on an-worker1080 to pick up new Ipv6 settings
2019-07-24
[edit]- 22:53 nuria: uploading of refinery-0.0.95 to archiva failed, reseting archiva pw
- 21:17 nuria: deployment of refinery 0.0.95 aborted
- 16:40 ottomata: removed all non reportupdater-queries job repositories from /srv/reportupdater/jobs/ - T222739
- 07:55 elukey: restart pageview-hourly oozie coordinator to pick up new hive2 action settings
2019-07-23
[edit]- 09:23 elukey: restart projectview-hourly-coordinator with correct config - T228731
2019-07-22
[edit]- 17:33 nuria: finished deploying refinery (no refinery source deploy, just bumping up jars)
2019-07-18
[edit]- 18:34 ottomata: backfilling MobileWikiAppDailyStats data since June 7 to populate misisng fields (e.g. appinstallid) in refined data. - T226219
- 14:28 nuria: deployed refinery v0.0.40
2019-07-17
[edit]- 18:10 nuria: stating build of new refinery-source 0.0.94
2019-07-15
[edit]- 16:46 elukey: add ipv6 aaaa/ptr records for an-worker* hosts (still didn't have them)
- 14:32 elukey: resize /srv on an-coord1001 to 103G (-10G) to allow lvm backups
- 14:31 elukey: restart hive/oozie/mariadb on an-coord1001 after mainteance
- 14:23 elukey: temporary stop oozie/hive/mariadb for maintenance
- 13:46 elukey: stop all timers on an-coord1001 as prep step for maintenance
- 06:50 elukey: run msck repair table mediawiki_wikitext_history in beeline
2019-07-11
[edit]- 21:28 ottomata: rerunning eventlogging_to_druid_readingdepth_hourly
- 21:26 ottomata: rerunning eventlogging_to_druid_navigationtiming_hourly
- 21:18 ottomata: rerunning /usr/local/bin/eventlogging_to_druid_prefupdate_hourly
- 20:31 ottomata: resized /srv on an-coord1001 from 60G to 115G - T227132
- 16:07 elukey: sudo chown -R analytics:analytics /srv/geoip/archive/ on stat1007
- 15:47 elukey: chown -R analytics:analytics /wmf/data/archive/geoip on HDFS
2019-07-09
[edit]- 18:58 nuria: re-refining ExternalGuidance events for July 2019
- 14:47 ottomata: moved all mediawiki_page_* event tables to schema aware refine job
- 13:26 elukey: enable base::firewall on stat1007
2019-07-08
[edit]- 07:03 elukey: add base::firewall to stat1004
2019-07-05
[edit]- 08:16 elukey: forced manual run of refinery-druid-drop-public-snapshots.service on an-coord1001
2019-07-04
[edit]- 08:00 joal: Kill mediawiki-history-redeuced coordinator and restart it with manually patched version
2019-07-03
[edit]- 21:57 nuria: deployed wikistats2 https://gerrit.wikimedia.org/r/#/c/analytics/wikistats2/+/520632/
2019-07-02
[edit]- 10:10 elukey: reset-failed refinery-sqoop-mediawiki-private.service
- 01:40 milimetric: deployed refinery, restarted refinery-mediawiki-sqoop-private
2019-07-01
[edit]- 20:29 ottomata: removed old refinery deploy caches from an-coord1001 to free up disk space
- 20:19 milimetric: syncing to hdfs on minor refinery deploy to remove hiwikisource from sqoop lists
- 10:26 elukey: removed Hive tables and Database from Superset - T223919
- 06:37 joal: Move newly computed snapshot for 2019-05 in place of original one for new checker run to normally succeed
2019-06-28
[edit]- 20:57 joal: Restart mediawiki-history-denormalize-coord, mediawiki-history-check_denormalize-coord, mediawiki-history-reduced-coord
- 20:04 joal: drop-recreate mediawiki_history, mediawiki_page_history and mediawiki_user_history tables in hive
- 18:59 joal: Restart Webrequest bundle
- 18:53 joal: Kill data-quality-hourly bundle
- 18:52 joal: Kill webrequest bundle
- 18:33 joal: Deploy refinery with scap
- 18:15 joal: Deploy refinery to HDSF
- 17:43 elukey: deleted /srv/home/nathante/.local/share/Trash/* to free space on notebook1004
- 17:12 joal: Deploying refinery with scap
- 17:12 joal: Refinery-source v0.0.93 released to archiva
2019-06-26
[edit]- 07:23 joal: manually rerun webrequest-druid-hourly-wf-2019-6-26-5 (second failure, druid reboot this time)
- 07:04 joal: Manually rerun pageview-hourly-wf-2019-6-26-5, aqs-hourly-wf-2019-6-26-5 and webrequest-druid-hourly-wf-2019-6-26-5
- 06:22 elukey: stop camus and other timers on an-coord1001 (prep step for reboot)
2019-06-25
[edit]- 14:00 ottomata: killing mediawiki-load-bundle - T226436
2019-06-20
[edit]- 17:18 milimetric: deployed wikistats 2
2019-06-19
[edit]- 19:57 joal: Killing mediawiki-history-wikitext job becasue of failures due to userId parsing (same as previous month)
- 13:41 ottomata: renaming event.mediawiki_page_restrictions_change to event.mediawiki_page_restrictions_change_T226051 - T226051
2019-06-17
[edit]- 18:34 elukey: run hdfs fsck / on an-master1001
- 15:52 elukey: re-run webrequest-load-wf-upload-2019-6-17-14 and webrequest-load-wf-upload-2019-6-17-13, failed due to reboots
- 14:34 elukey: manual run of mediawiki-history-drop-snapshot.service to test new debug log
- 14:16 elukey: re-run webrequest-load-wf-text-2019-6-17-12 manually, failed due to reboots
- 13:45 elukey: re-run webrequest-load-wf-upload-2019-6-17-12, failed due to reboots
2019-06-16
[edit]- 09:53 elukey: hdfs dfs -chmod o-rw /wmf/data/raw/netflow
- 09:52 elukey: hdfs dfs -chown -R analytics:analytics-privatedata-users /wmf/data/raw/netflow
- 08:09 elukey: manually restart refinery-druid-drop-public-snapshots.service with new unit settings (-t druid1004.eqiad.wmnet vs -t druid1004.eqiad.wmnet:8081)
2019-06-14
[edit]- 13:22 joal: Restarting AQS using `scap deploy --service-restart`
2019-06-13
[edit]- 18:18 fdans: deployment complete
- 17:42 fdans: deploying refinery
- 17:40 fdans: updating refinery jar symlinks
- 17:20 fdans: Releasing new version of refinery source (v0.0.92)
2019-06-11
[edit]- 07:38 fdans: reset fail alert for efinery-import-page-history-dumps
2019-06-10
[edit]- 18:12 joal: Restart pageview, pageview-druid-hourly/daily/monthly ooie jobs for them to run in production queue
- 18:05 joal: Kill/Restart webrequest bundle and move it to production queue
- 17:54 ottomata: rolling restart of AQS service using scap deploy for new mediawiki_history_snaphost
2019-06-08
[edit]- 08:17 joal: Manually re-run patched refine_eventlogging_analytics on an-coord1001 with flags "--ignore_failure_flag=true --since 48"
- 08:12 elukey: remove org.wikimedia.analytics.refinery.job.refine.filter_out_non_wiki_hostname from refine's transform functions temporarily to unblock T225342
- 07:37 elukey: manual run of monitor_refine_eventlogging_analytics
- 07:28 joal: Manually run refine_eventlogging_analytics on an-coord1001 with flag --ignore_failure_flag=true
2019-06-07
[edit]- 17:42 joal: Drop currently unused /wmf/data/wmf/webrequest_subset folder
- 17:29 elukey: chown -R analytics:analytics-privatedata-users + chmod o-rw /wmf/data/wmf/netflow on HDFS
- 17:18 mforns: restarted turnilo to clear deleted datasource
- 17:17 elukey: restart turnilo to remove the old netflow datasource's settings
- 17:01 mforns: restarted turnilo to clear deleted datasource
- 16:18 joal: rerun webrequest-load-wf-text-2019-6-7-14 after failure
- 09:59 joal: Kill wikitext-history job to prevent more resource-consuption becasue of failures
2019-06-06
[edit]- 09:52 elukey: chown report updater output dirs on stat1007 to analytics:wikidev (was hdfs:wikidev) to unblock creation of new data
- 09:45 elukey: re-run refine_sanitize_eventlogging_analytics_immediate with since = 900 in the .properties file
- 06:38 elukey: re-run refine_sanitize_eventlogging_analytics_immediate with since = 48 in the .properties file (manually added)
- 05:36 elukey: chown analytics:analytics /wmf/data/event_sanitized/{CentralNoticeTiming,LayoutJank,EventTiming,ElementTiming} (new directories created with yarn:analytics)
2019-06-05
[edit]- 20:59 mforns: finished deployment of analytics/refinery up to 0660e70153dec892ae20bee7119a72cc17e8ec87
- 20:20 mforns: starting deployment of analytics/refinery up to 0660e70153dec892ae20bee7119a72cc17e8ec87
- 18:20 mforns: finished deployment of analytics/refinery/source v0.0.91
- 18:00 mforns: starting deployment of analytics/refinery/source v0.0.91
- 10:07 elukey: attempt to re-run webrequest-load-wf-text-2019-6-4-20 via Hue (temporary errors in the logs)
2019-06-04
[edit]- 08:03 elukey: restart hive-server2 on an-coord1001 to pick up new GC/Heap settings
- 06:57 elukey: restart hive metastore on an-coord1001 to apply new GC/heap settings
2019-06-03
[edit]- 06:51 elukey: add the server field to the webrequest event format in varnishkafka + roll restart of all the varnishkafkas (via puppet) - T224236
2019-06-02
[edit]- 07:04 elukey: manually restart refinery-import-page-history-dumps.service with some debug info to check what file breaks
- 04:50 joal: Restart mediawiki-history-wikitext (dumps conversion) oozie job
- 04:12 joal: Restart load-cassandra oozie bundle to use analytics user
2019-06-01
[edit]- 08:03 elukey: manually restart refinery-sqoop-whole-mediawiki.service after failure
2019-05-27
[edit]- 19:42 elukey: chown analytics:analytics /wmf/data/event/mediawiki_job_userOptionsUpdate on HDFS
2019-05-22
[edit]- 21:29 joal: Manually refine webrequest_upload_2019_05_22_12 removing 19 rows having user-agents causing UAParser issue
- 20:44 joal: Manually refine webrequest_text_2019_05_22_12 removing 19 rows having user-agents causing UAParser issue
- 17:27 joal: Manually Rerun webrequest-load-wf-upload-2019-5-22-12 with higher error-threshold as dataloss-error is confirmed flase positive
2019-05-21
[edit]- 06:28 elukey: chown analytics:analytics /user/hdfs/salts/eventlogging_sanitization on HDFS
2019-05-20
[edit]- 17:17 elukey: chown -R analytics:analytics /tmp/DataFrameToDruid on HDFS
- 16:39 joal: Manually run webrequest-load-wf-upload-2019-5-20-11 with higher error threshold as error were false positive
- 15:28 joal: Rerunning timeout webrequest-load-coord-text and webrequest-load-coord-upload (2019-05-20T09:00)
- 14:41 elukey: chown analytics:analytics /wmf/data/event_sanitized on HDFS
- 12:02 elukey: chown analytics:analytics /wmf/data/event on HDFS
- 12:00 elukey: chown analytics:analytics /wmf/data/wmf/event on HDFS
- 10:21 elukey: chown -R analytics:analytics /wmf/data/raw/ dirs (except the webrequest one that has different perms)
- 10:07 elukey: chown analytics:analytics /wmf/camus dirs (except the webrequest dir)
- 08:49 elukey: move report updater HDFS jobs to the analytics user
2019-05-18
[edit]- 11:25 elukey: delete analytics-store config from Superset
2019-05-17
[edit]- 07:46 elukey: restart mediawiki history and denormalize coordinators with the new analytics user (left mediawiki-history-wikitext-coord aside for further investigation)
- 07:22 elukey: chown -R analytics:analytics /wmf/data/wmf/mediawiki
2019-05-16
[edit]- 20:08 joal: Manually fixing banner job
- 19:53 joal: Restarting banner_activity-druid-monthly-coord after chuu chuu
- 16:43 elukey: chown analytics:analytics /wmf/camus/webrequest-00 on HDFS
- 16:36 elukey: restart the webrequest-load-bundle after the previous chown of the webrequest raw data
- 16:23 elukey: chown -R analytics /wmf/data/raw/webrequest - step missed in earlier on migration
- 14:09 elukey: restart the webrequest-druid-hourly-coord coordinator with the analytics user
- 14:08 elukey: restart the webrequest-druid-daily-coord coordinator with the analytics user
- 13:57 elukey: start webrequest-load-bundle from hour 12:00
- 13:27 elukey: chown -R analytics:analytics /user/hive/warehouse/wmf_raw.db on HDFS
- 13:23 elukey: chown -R analytics:analytics /wmf/data/raw/webrequests_faulty_hosts on HDFS
- 13:08 elukey: chown -R analytics:analytics /wmf/data/raw/webrequests_data_loss on HDFS
- 12:57 elukey: chown -R analytics:analytics-privatedata-users /wmf/data/wmf/webrequest on HDFS
- 12:53 elukey: kill the webrequest-load-bundle in hue - prep step to migrate the webrequest bundle to the analytics user
- 12:49 elukey: kill webrequest-load-coord-upload from hue - prep step to migrate the webrequest bundle to the analytics user
2019-05-15
[edit]- 21:00 fdans: refinery deployed successfully
- 20:43 fdans: deploying refinery
- 20:31 fdans: updating symlinks for jars
- 20:11 fdans: deploying refinery source
- 18:02 fdans: rerunning refine for VirtualPageviewHourly @ 9am
- 10:34 elukey: superset upgraded to 0.32
2019-05-14
[edit]- 15:33 mforns: restart turnilo to clear deleted datasource
2019-05-12
[edit]- 15:33 elukey: rollback python-kafka one eventlog1002 to 1.4.1-1~stretch1
- 12:14 elukey: restart eventlogging on eventlog1002
2019-05-10
[edit]- 15:53 elukey: kill mediacounts-archive coordinator, chown analytics:analytics /wmf/data/archive/mediacounts + restart the coord with the analytics user
- 14:53 ottomata: restarted eventlogging with python-kafka-1.4.3
- 14:17 ottomata: downgrading python-kafka from 1.4.6-1~stretch1 to 1.4.3-2~wmf0 on eventlog1002 - T221848
- 06:30 elukey: refine with higher loss threshold webrequest upload 2019-5-8-18
2019-05-09
[edit]- 16:38 elukey: restart hive-server2 on an-coord1001 due to OOMs
- 16:13 elukey: killed application_1555511316215_77583 from Yarn CLI
- 11:04 elukey: kill oozie mediawiki geoeditors coords (3 in total) + chown -R analytics /wmf/data/wmf/mediawiki_private (raw data already chowned with analytics:analytics) + restart of the coords with the analytics user
- 09:17 elukey: restart oozie wikidata coordinators (3 in total) with the analytics user
2019-05-08
[edit]- 15:39 mforns_: deploying refinery up to 698f2137aa965b07548ae7565aafaa784628b13c and together with refinery-source 0.0.89
- 14:55 elukey: kill last_access_uniques-daily-asiacell-coord from hue (coord not used anymore)
- 14:05 mforns_: deployed refinery-source up to ad74c41b05d5f838df6febb379e883855abb203d
- 13:09 mforns_: started deployment train
- 11:15 elukey: kill projectview coords (2 in total) + chown analytics:analytics /wmf/data/wmf/projectview and /wmf/data/archive/projectview + restart coords with the analytics user
- 09:50 elukey: kill virtualpageviews coords (3 in total) + chown analytics:analytics /wmf/data/wmf/virtualpageview + restart of the coords with user analytics
- 09:00 elukey: kill unique_devices coords (10 in total) + chown analytics:analytics /wmf/data/wmf/unique_devices + restart of 10 coords with user analytics
- 06:41 elukey: chown /tmp/mobile_apps to analytics:analytics
2019-05-07
[edit]- 14:37 elukey: kill pageview oozie coord (4 in total) + chown analytics:analytics /wmf/data/wmf/pageview /wmf/data/archive/pageview + restart of the coordinators with the analytics user
- 11:46 elukey: kill mobile apps coordinators + chown analytics:analytics /wmf/data/archive/mobile_apps, /wmf/data/wmf/mobile_apps + restart of the coordinators with user analytics
- 11:23 joal: Updating /wmf/data/raw/mediawiki_private/tables to be owned by analytics:analytics
- 11:19 joal: Updating /wmf/data/raw/mediawiki/xmldumps to be owned by analytics:analytics
- 11:19 joal: Updating /wmf/data/raw/mediawiki/project_namespace_map to be owned by analytics:analytics
- 11:19 joal: Updating /wmf/data/raw/mediawiki/tables to be owned by analytics:analytics
- 09:38 elukey: kill clickstream-coord, chown /wmf/data/archive/clickstream to analytics:analytics, restart the job with the analytics user override
- 09:24 elukey: kill ores-revision-scores-public-coord via hue (not used anymore)
- 07:46 elukey: temporary override of oozie/util/druid/load/workflow.xml in HDFS's refinery to allow the analytics user to push data to druid from oozie
2019-05-06
[edit]- 14:22 elukey: kill apis-coord and relaunch it with user analytics
- 13:51 elukey: kill mediacounts-load-coor, chown analytics:analytics /wmf/data/wmf/mediacounts, restart coordinator with user 'analytics'
- 12:52 elukey: kill interlanguage-coord, chown analytics:analytics /wmf/data/wmf/interlanguage, restart coordinator with user 'analytics'
- 12:15 elukey: kill browser-general-coor, chown analytics:analytics /wmf/data/wmf/browser, restart coordinator with user 'analytics'
- 09:32 joal: manually touching success files to start banner_activity-druid-monthly-coord between 2018-06-01/2018-12-31
- 09:28 joal: Launch new banner_activity-druid-monthly-coord between 2018-06-01/2018-12-31 to cover for timedout past actions
- 09:13 elukey: kill banner impression coordinators, chown /wmf/data/wmf/banner_impressions to analyits:analytics and start coordinators again
- 07:42 elukey: chown -R /wmf/data/wmf/aqs/* to analytics:analytics (was: analytics:hdfs)
2019-05-05
[edit]- 07:31 joal: Manually laumch druid indexation of mediawiki_history_reduced_2019_04
- 07:29 joal: Manually add 2019-04 hive partition to mediawiki_history_reduced after automated job failed (expected failure after refactor)
2019-05-03
[edit]- 07:14 joal: Restarting mediawiki-history-check_denormalize-coord with missing parameter (patch provided to prevent the coord to start without it)
2019-05-02
[edit]- 13:30 joal: Restarting AQS oozie job with -Duser=analytics parameter
- 13:10 joal: Kill oozie aqs-hourly-coord
- 08:57 elukey: manual start of refinery-sqoop-mediawiki-production.service
2019-05-01
[edit]- 20:02 ottomata: sudo systemctl stop refinery-sqoop-mediawiki-production
- 19:58 ottomata: sudo systemctl disable refinery-sqoop-mediawiki-production
- 18:25 joal: restarted oozie jobs mediawiki-history-denormalize-coord, mediawiki-history-check_denormalize-coord and mediawiki-history-reduced-coord
- 18:22 joal: Confirming that sqoop-private (cu_changes) will run automatically tonight (2nd of the month at 00:00) - nothing needed
- 18:08 joal: Manually killed sqoop-production (comment and actor) to have it done after the current manual labs run
- 17:59 joal: Starting a manual run of sqoop
- 17:55 joal: deploying refinery onto HDFS
- 17:40 joal: deploy refinery using scap after failed attemp
- 17:37 joal: Recreate wmf.mediawiki_history (+page and user) and wmf.mediawiki_history_archive (with old data)
- 17:12 joal: Moving exisiting mediawiki-history to /wmf/data/wmf/mediawiki/archive folder
- 17:08 joal: Killing oozie jobs for new deploy: mediawiki-history-denormalize-coord, mediawiki-history-check_denormalize-coord, mediawiki-history-reduced-coord
- 17:02 joal: Deploying refinery using scap
- 16:48 joal: Kill oozie mediawiki-history-druid-coord for true (replaced by edit_hourly job)
- 16:08 joal: refinery-source v0.0.88 released on archiva
2019-04-30
[edit]- 13:25 ottomata: restarting eventlogging processes to upgrade to python-kafka 1.4.6 - T221848
2019-04-29
[edit]- 08:22 joal: Deploying refinery using scap (analytics-deploy user test)
2019-04-25
[edit]- 14:19 mforns: Restarted Turnilo to clear deleted datasource
2019-04-24
[edit]- 15:00 elukey: set innodb_file_format=Barracuda and innodb_large_prefix=1 on mariadb on an-coord1001 to allow bigger indexes for Superset db upgrades
- 07:43 fdans: refinery uploaded to hdfs and webrequest bundle restarted
- 07:06 fdans: restarted webrequest bundle
- 06:24 elukey: kill of application_1555511316215_18282 on Hadoop due to excessive resource usage
2019-04-23
[edit]- 13:49 elukey: delete tbayer_popups from druid analytics - T220575
- 09:13 fdans: refinery deployed successfully
- 08:28 fdans: deploying refinery
- 08:26 fdans: refinery source v0.0.87 released and symlinks updated
- 07:04 fdans: releasing refinery source v0.0.86 for what I hope is the last time
2019-04-18
[edit]- 18:55 fdans: updated jars
- 18:53 fdans: Release of v0.0.86 in maven succeeded
- 15:22 fdans: restarting release of version 0.0.86 of refinery source to maven
- 14:29 fdans: releasing version 0.0.86 of refinery source to maven
2019-04-17
[edit]- 09:06 elukey: restart eventlogging on eventlog1002 due to errors in processors and consumer lag accumulated after the last Kafka Jumbo roll restart
2019-04-13
[edit]- 09:21 elukey: re-run failed webrequest-text 2018-04-13-07 job - temporary failure between Hive and HDFS
2019-04-12
[edit]- 10:12 elukey: matomo upgraded to 3.9.1 to fix some security vulns
2019-04-10
[edit]- 14:48 elukey: restart turnilo to pick up the new nodejs runtime
- 13:58 joal: Deploying AQS
2019-04-09
[edit]- 18:40 ottomata: chowning files in analytics.wm.org/datasets/archive/public-datasets/ as stats:wikidev
- 15:00 fdans: backfilling data between previous backfill end and start of puppetized job for PrefUpdate
- 13:53 mforns: restarted turnilo to clear deleted datasource
2019-04-08
[edit]- 14:50 fdans: backfilling prefupdate schema into druid from Jan 1 2019 until Apr 1 2019
2019-04-04
[edit]- 21:20 mforns: Restarted turnilo to clear deleted datasource
2019-04-03
[edit]- 19:16 elukey: failover from namenode on 1002 (currently active after the outage) to 1001 (standby)
- 18:07 joal: mediawiki-history-checker manual rerun successful
- 15:22 elukey: execute kafka preferred-replica-election on kafka-jumbo
2019-04-02
[edit]- 17:54 mforns: restarted turnilo to clear deleted datasource
- 17:29 milimetric: revision/pagelinks failed wikis rerun successfully, now forcing comment/actor rerun
- 15:02 mforns: Rerunning webrequest-load-coord for 2019-04-01T22
- 14:59 elukey: re-run of webrequest upload 2019-04-01-14 with higher data loss threshold
- 10:14 elukey: restart eventlogging's mysql consumers on eventlog1002 - T219842
- 06:18 joal: Deleted (in hdfs bin) actor and comment table data because it has been sqooped too early - manual rerun will be started once labs sqoop is done
2019-04-01
[edit]- 06:02 elukey: kill + re-run of pageviews hourly 30-03 hour 7 - seems stuck in heart beat after reduce completed
2019-03-29
[edit]- 12:29 mforns: Restarted Turnilo to refresh deleted test datasource
- 12:11 mforns: Restarted Turnilo to refresh deleted test datasource
- 11:52 mforns: Restarted Turnilo to refresh deleted test datasource
- 11:10 mforns: Restarted Turnilo to refresh deleted test datasource
2019-03-28
[edit]- 19:04 joal: Manually rerun webrequest-load-wf-upload-2019-3-28-8 with higher error threshold (alot of false positive!)
2019-03-27
[edit]- 21:13 milimetric: done deploying refinery, will now restart monthly geoeditors coordinator
2019-03-18
[edit]- 11:08 elukey: restart hue on analytics-tool1001 to pick up some new changes (should be a no-op)
2019-03-14
[edit]- 17:43 mforns: Deploying AQS using scap (node10 upgrade)
2019-03-13
[edit]- 22:58 nuria: mediawiki-check denormalized restart ed 0147256-181112144035577-oozie-oozi-C
- 22:48 nuria: killed oozie job 0131427-181112144035577-oozie-oozi-C to correct e-mail address
2019-03-12
[edit]- 16:06 joal: Rerun webrequest-load-wf-text-2019-3-12-11 after error
2019-03-08
[edit]- 20:48 joal: Rerun webrequest-load-wf-upload-2019-3-8-19 after hive outage
- 14:52 joal: deployed wikistats2 2.5.5
2019-03-07
[edit]- 14:50 joal: Restart mediawiki-history after having corrected data
- 13:52 joal: manually killing mediawiki-history-denormalize-wf-2019-02 instead of letting it fail another 3 attemps
- 10:40 joal: Manually fixed sqoop issues
2019-03-06
[edit]- 18:13 joal: Refinery deployed onto hadoop
- 18:08 joal: Refinery deployed using scap
2019-03-04
[edit]- 16:17 elukey: disable all report updater jobs via puppet (ensure => absent) due to dbstore1002 decom
2019-02-28
[edit]- 17:16 milimetric: restarted mediawiki/history/load job: https://hue.wikimedia.org/oozie/list_oozie_coordinator/0131840-181112144035577-oozie-oozi-C/
- 14:40 milimetric: refinery deployed with new sqoop logic and updated history/load job
- 09:57 fdans: restarting mediawiki-history-wikitext coordinator
- 09:56 fdans: restarting mediawiki-history-check_denormalize
- 09:48 fdans: restarting mediawiki-history-denormalize coordinator
2019-02-27
[edit]- 17:42 elukey: re-run webrequest-load-wf-upload-2019-2-27-16 (failed due to a shutdown of analytics1071 for hw maintenance)
2019-02-24
[edit]- 10:24 elukey: restart check webrequest service on an-coord1001 (failed due to /mnt/hdfs being unavail)
2019-02-20
[edit]- 18:17 fdans: deploying refinery
- 16:03 ottomata: removing spark 1 from Analytics cluster - T212134
2019-02-19
[edit]- 09:47 mforns: deployed refinery (without refinery-source) until commit 0d7ec1989852d4dd5b1497463fd9509e4f5bdb87
2019-02-15
[edit]- 18:18 nuria: restarted turnilo in analytics-tool1002
2019-02-14
[edit]- 09:07 joal: rerun mediawiki-history-wikitext-wf-2019-01
- 09:06 joal: Re-run webrequest-load-wf-text-2019-2-14-6
2019-02-13
[edit]- 19:46 mforns: Deploying refinery with scap
- 19:12 mforns: Deployed refinery-source v0.0.85 using jenkins
2019-02-12
[edit]- 09:39 elukey: systemctl disable/stop mediawiki-geoeditors-drop-month.timer on an-coord1001
2019-02-11
[edit]- 10:01 elukey: restart superset to pick up new config.py changes
- 08:38 elukey: restart superset to pick up new settings in config.py
2019-02-10
[edit]- 10:52 elukey: re-run webrequest upload webrequest-load-wf-upload-2019-2-10-0
- 10:52 elukey: killed oozie job related to webrequest-load-wf-upload-2019-2-10-0, seemed stuck in generate_sequence_statistics (not really clear why)
2019-02-08
[edit]- 13:45 joal: wikistats2 snapshot updated to 2019-01
2019-02-06
[edit]- 19:28 milimetric: deployed refinery
- 18:41 joal: Killling-restarting mediawiki-history related oozie jobs
2019-02-04
[edit]- 20:30 joal: Confirm that last week dataloss warnings were false alarms (upload -> 2019-1-28-15, 2019-1-28-16, 2019-2-1-1, 2019-2-1-4, 2019-2-1-13 -- text -> 2019-2-1-13, 2019-2-1-15)
- 14:47 joal: Rerun webrequest-load-coord-text for 2019-02-04T04:00:00
2019-01-24
[edit]- 11:49 mforns: Restarted Turnilo to remove deleted datasource
2019-01-23
[edit]- 15:24 elukey: added lea-wmde and goransm to Superset
2019-01-22
[edit]- 20:30 milimetric: updated hive tables in wmf_raw for actor/comment refactor
- 19:00 milimetric: deployed refinery with refinery-source
- 15:15 mforns: Restarted turnilo to clear deleted datasource
- 08:59 elukey: clean up reportupdater_discovery-stats-interactive from stat1006 - old job not cleaned up
2019-01-21
[edit]- 09:34 elukey: removed ./jobs/limn-language-data/interlanguage/.reportupdater.pid in /srv/reportupdater on stat1007 to force the first run of the timer
2019-01-17
[edit]- 13:57 elukey: re-run pageview-hourly-wf-2019-1-12-14's coordinator
2019-01-15
[edit]- 14:56 fdans: "rolling back to stable superset"
- 14:40 fdans: deploying superset 0.26.3-wikimedia1
- 14:36 elukey: stop superset to allow a clean mysqldump
2019-01-14
[edit]- 17:48 nuria: restarting tUrnilo to pick up new config.. sigh
- 17:47 nuria: restarting tornilo to pick up new config
- 16:55 elukey: restart turnilo to pick up new changes
- 16:40 ottomata: running refine eventlogging analytics for dec 17 2018 12:00 - 16:00 - T213602
- 15:26 elukey: reimage stat1005 - T205846
2019-01-09
[edit]- 14:13 elukey: shutdown all the hdfs datanode daemons on the decom nodes (analytics1028->41)
2019-01-08
[edit]- 08:09 elukey: manual stop of hdfs balancer to ease the under replicated blocks healing (worker nodes already decently balanced)
- 07:24 elukey: decommission analytics10[39-41] from Analytics Hadoop
2019-01-07
[edit]- 22:02 mforns: Finished to restart oozie jobs after refinery deployment
- 21:24 mforns: Finished deployment of refinery using scap and refinery-deploy-to-hdfs, proceeding to restart oozie jobs
- 21:05 mforns: Starting deployment of refinery using scap and refinery-deploy-to-hdfs
- 19:48 ottomata: merging change to make rsync server modules pull only - T205157 , T205152
- 19:46 mforns: Deployed refinery-source using jenkins
- 17:21 joal: Manually repair hive table and add _PARTITIONED flag to project_namespace_map
- 17:03 elukey: re-enabled eventlogging mysql consumers
- 16:02 elukey: stop eventlogging mysql consumers on eventlog1002 - db1107 down
- 15:36 mforns: Restarted turnilo to clear deleted test datasource
- 11:26 elukey: move hue/oozie/hive password handling from auto-load to role lookup in the puppet private repo
- 09:19 joal: Deploying refinery onto HDFS so that refinery-job-0.0.82.jar is present on HDFS (needed to run mediawiki-history successfully)
- 08:49 joal: Rerun failed mediawiki-denormalize job with update spark conf
- 07:29 elukey: decom analytics103[7,8] from Analytics Hadoop
2019-01-06
[edit]- 07:58 elukey: manually stopped the hdfs-balancer to ease the decom process (the hdfs nodes are already nicely balanced)
- 07:55 elukey: decom analytics103[5,6] from Analytics Hadoop
2019-01-05
[edit]- 07:37 elukey: decommission analytics1033/34 from the Hadoop cluster
2019-01-04
[edit]- 09:42 joal: Kill banner test kafka-druid ingestion job
- 08:16 elukey: restart eventlogging daemons on eventlog1002 to pick up openssl updates
- 07:39 elukey: decommission analytics1031/32 from the Hadoop analytics cluster
- 07:37 elukey: manually stopped hdfs-balancer (cluster already balanced, only one host left with some blocks to get) to ease the decom of two more nodes
2019-01-03
[edit]- 11:26 elukey: manually started the hdfs-balancer (failed earlier on due to the presence of a lock file)
2019-01-02
[edit]- 18:03 elukey: decom analytics10(29|30) from HDFS/Yarn
- 10:31 elukey: killed all hdfs-balancer processes (one running since ages ago in 2018)
- 09:16 elukey: decom analytics1028 from hdfs/yarn