Analytics/Server Admin Log/Archive/2021
Appearance
2021-12-22
[edit]- 19:13 milimetric: Additional context on the last delete message: on an-launcher1002 which is filled up
- 19:12 milimetric: Marcel and I are deleting files from /tmp older than 60 days
- 15:55 mforns: finished refinery deployment for anomaly detection queries
- 14:54 mforns: starting refinery deployment for anomaly detection queries
2021-12-20
[edit]- 18:59 mforns: finished deployment of refinery, adding anomaly detection hql for airflow job
- 18:39 mforns: started to deploy refinery, adding anomaly detection hql for airflow job
2021-12-17
[edit]- 12:32 btullis: Upgraded druid packages, with pool/depool on druid1004
- 11:20 btullis: btullis@an-test-druid1001:~$ sudo apt-get install druid-broker druid-common druid-coordinator druid-historical druid-middlemanager druid-overlord
- 11:18 btullis: updating reprepo with new druid packages for buster-wikimedia to pick up new log4j jar files
2021-12-16
[edit]- 11:01 btullis: btullis@an-test-druid1001:~$ sudo apt-get install druid-broker druid-common druid-coordinator druid-historical druid-middlemanager druid-overlord
- 11:01 btullis: upgrading druid on the test cluster with new packages to test log4j changes.
2021-12-15
[edit]- 08:51 joal: Rerun failed cassandra-daily-wf-local_group_default_T_mediarequest_per_file-2021-12-13 after cluster restart
- 07:20 elukey: elukey@stat1007:~$ sudo systemctl reset-failed product-analytics-movement-metrics
2021-12-14
[edit]- 19:02 milimetric: finished deploying the weekly train as per etherpad
- 18:04 joal: Rerun failed cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2021-12-13 after cluster reboot
- 17:51 btullis: rebooting aqs1015
- 17:25 btullis: rebooting aqs1013
- 17:19 btullis: rebooting aqs1012
- 16:00 btullis: rebooting aqs1011
- 15:53 btullis: rebooting aqs1010
- 15:00 btullis: btullis@aqs1010:~$ sudo nodetool-a repair --full system_auth
- 14:59 btullis: cassandra@cqlsh> ALTER KEYSPACE "system_auth" WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': '12'}; on aqs1010-a
- 14:25 btullis: btullis@aqs1011:$ sudo systemctl start cassandra-b.service
- 12:44 joal: Rerun failed cassandra-hourly-wf-local_group_default_T_pageviews_per_project_v2-2021-12-14-10
- 12:42 joal: Kill late spark cassandra loading job
2021-12-11
[edit]- 10:06 elukey: kill process 2560 on stat1005 to allow puppet to clean up the related user (offboarded)
- 10:04 elukey: kill process 2831 on stat1008 to allow puppet to clean up the related user (offboarded)
2021-12-09
[edit]- 11:08 btullis: roll restarting druid historical daemons on analytics cluster T297148
- 10:46 btullis: roll restarting druid brokers on analytics cluster
2021-12-07
[edit]- 20:09 ottomata: deploy wikistats2 with doc updates
2021-12-03
[edit]- 17:36 razzi: restart aqs-next to pick up new mediawiki snapshot: `razzi@cumin1001:~$ sudo cumin A:aqs-next 'systemctl restart aqs'`
- 17:36 razzi: restart aqs to pick up new mediawiki snapshot: `razzi@cumin1001:~$ sudo cookbook sre.aqs.roll-restart aqs`
- 07:33 elukey: move kafka-test to fixed uid/gid
2021-12-02
[edit]- 20:05 ottomata: restarting pageview-druid-daily-coord (killing 0062888-210701181527401-oozie-oozi-C) - I can't seem to rerun a particular hour, so just starting again from that hour.
- 17:57 elukey: drop "EventLogging MySQL" datasource from Superset (not valid anymore)
- 17:26 joal: Kill paragon job to prevent more nodemangers to OOM
2021-12-01
[edit]- 20:40 razzi: deploy refinery for T296089 patch https://gerrit.wikimedia.org/r/c/analytics/refinery/+/742672
2021-11-27
[edit]- 09:56 elukey: powercycle analytics1071, soft lockup stacktraces in the tty
2021-11-24
[edit]- 17:30 mforns: Deployed refinery using scap, then deployed onto hdfs
- 12:31 btullis: btullis@an-launcher1002:~$ sudo systemctl reset-failed monitor_refine_event_sanitized_analytics_delayed.service
- 07:10 elukey: drop /tmp/blockmgr-20fe4b2b-31fb-4a85-b5b1-bebe254120f8 on stat1006 to free space on the root partition
2021-11-23
[edit]- 11:56 btullis: roll-restarting the cassandra services on the aqs cluster. (Not the aqs_next cluster)
- 11:49 btullis: btullis@an-coord1001:~$ sudo systemctl restart presto-server.service
- 11:49 btullis: btullis@an-coord1001:~$ sudo systemctl restart oozie.service
2021-11-22
[edit]- 12:18 btullis: failed back the hive services to an-coord1001 via CNAME change
- 11:36 btullis: btullis@an-coord1001:~$ sudo systemctl restart hive-server2 hive-metastore
- 10:44 btullis: deploying DNS change to switch hive to the standby server.
- 10:18 btullis: btullis@an-coord1002:~$ sudo systemctl restart hive-server2 hive-metastore
2021-11-18
[edit]- 17:26 elukey: varnishkafka-webrequest on cp3050 is running with /etc/ssl/localcerts/wmf_trusted_root_CAs.pem
- 10:03 elukey: restart prometheus-druid-exporter on Druid Analytics to clear unnecessary metrics
- 07:32 elukey: restart prometheus-druid-exporter on Druid Public to see metrics difference
2021-11-17
[edit]- 16:01 btullis: roll-restarting kafka-test brokers
- 12:12 btullis: roll-restarting the presto analytics workers
- 11:44 btullis: btullis@archiva1002:~$ sudo systemctl restart archiva.service
- 07:29 elukey: `apt-get clean` on an-tool1005 to free space in the root partition
- 07:28 elukey: `sudo pkill -U jmixter` on stat100[5,8] to allow puppet to run and remove the offboarded user
2021-11-16
[edit]- 19:40 joal: Deploying refinery to HDFS
- 19:15 joal: Deploying refinery with scap
- 18:23 joal: Releasing refinery-source v0.1.21
- 11:32 btullis: btullis@cumin1001:~$ sudo cookbook sre.druid.roll-restart-workers public
- 10:20 btullis: roll-restarting hadoop masters
2021-11-15
[edit]- 16:37 joal: Rerun failed mediawiki-wikitext-history-wf-2021-10
2021-11-11
[edit]- 06:56 elukey: `systemctl start prometheus-mysqld-exporter@analytics_meta` on db1108
2021-11-10
[edit]- 18:20 btullis: btullis@an-launcher1002:~$ sudo systemctl reset-failed monitor_refine_event_sanitized_analytics_delayed.service
- 10:19 btullis: btullis@an-launcher1002:~$ sudo systemctl reset-failed monitor_refine_event_sanitized_analytics_delayed
2021-11-09
[edit]- 16:52 razzi: restart presto server on an-coord1001 to apply change for T292087
- 16:30 razzi: set superset presto version to 0.246 in ui
- 16:30 razzi: set superset presto timeout to 170s: {"connect_args":{"session_props":{"query_max_run_time":"170s"} for T294771}}
- 12:23 btullis: btullis@an-launcher1002:~$ sudo systemctl reset-failed monitor_refine_event_sanitized_analytics_delayed
- 07:23 elukey: `apt-get clean` on stat1006 to free some space (root partition full)
2021-11-08
[edit]- 19:51 ottomata: an-coord1002: drop user 'admin'@'localhost'; start slave; to fix broken replication - T284150
- 19:44 razzi: create admin user on an-coord1001 for T284150
- 18:07 razzi: run `create user 'admin'@'localhost' identified by <password>; grant all privileges on *.* to admin;` to allow milimetric to access mysql on an-coord1002 for T284150
2021-11-04
[edit]- 16:39 razzi: add "can sql json on superset" permission to Alpha role on superset.wikimedia.org
- 16:14 razzi: drop and restore superset_staging database to test permissions as they are in production
2021-11-03
[edit]- 17:07 razzi: razzi@an-tool1010:~$ sudo systemctl stop superset
- 16:57 razzi: dump mysql in preparation for superset upgrade
- 02:23 milimetric: deployed refinery with regular train
2021-10-29
[edit]- 23:04 btullis: deleted all remaining old cassandra snapshots on aqs100x servers.
- 22:58 btullis: deleted old snapshots from aqs1006 and aqs1009
- 17:45 razzi: set presto_analytics_hive extra parameter engine_params.connect_args.session_props.query_max_run_time to 55s on superset.wikimedia.org
- 10:39 elukey: roll restart of kafka-test to pick up new truststore (root PKI added)
2021-10-28
[edit]- 19:13 ottomata: re-enable hdfs-cleaner for /wmf/gobblin
2021-10-26
[edit]- 09:01 btullis: reverted hive services back to an-coord1001.
2021-10-25
[edit]- 16:03 btullis: btullis@an-coord1001:~$ sudo systemctl restart hive-server2 hive-metastore
- 13:02 btullis: btullis@an-coord1002:~$ sudo systemctl restart hive-server2 hive-metastore
- 12:51 btullis: btullis@aqs1007:~$ sudo nodetool-a clearsnapshot
2021-10-21
[edit]- 14:05 ottomata: rerun refine_eventlogging_analytics refine_eventlogging_legacy and refine_event with -ignore-done-flag=true --since=2021-10-21T01:00:00 --until=2021-10-21T04:00:00 for backfill of missing data after gobblin problems
- 13:39 btullis: btullis@an-launcher1002:~$ sudo systemctl restart gobblin-event_default
- 10:35 joal: Re-refine netflow data after gobblin pulled data fix
- 08:41 joal: Rerun webrequest-load jobs for hour 2021-10-21T02:00
2021-10-20
[edit]- 18:11 razzi: Deployed refinery using scap, then deployed onto hdfs
- 16:36 razzi: deploy refinery change for https://phabricator.wikimedia.org/T287084
- 07:15 joal: rerun webrequest-load-wf-upload-2021-10-20-1 after node issue
- 06:27 elukey: reboot analytics1066 - OS showing CPU soft lockups, tons of defunct processes (including node manager) and high CPU usage
2021-10-19
[edit]- 07:14 joal: Rerun cassandra-daily-wf-local_group_default_T_mediarequest_top_files-2021-10-17
2021-10-18
[edit]- 19:29 joal: Rerun cassandra-daily-wf-local_group_default_T_top_pageviews-2021-10-17
- 18:36 joal: Rerun cassandra-daily-wf-local_group_default_T_unique_devices-2021-10-17
- 16:22 joal: rerun cassandra-daily-wf-local_group_default_T_top_percountry-2021-10-17
- 16:16 joal: Rerun cassandra-daily-wf-local_group_default_T_mediarequest_per_referer-2021-10-17
- 15:17 joal: Rerun failed instances from cassandra-hourly-coord-local_group_default_T_pageviews_per_project_v2
- 14:49 elukey: restart hadoop-yarn-nodemanager on an-worker1119 and an-worker1103 (Java OOM in the logs)
- 12:09 btullis: root@aqs1013:/srv/cassandra-b/tmp# systemctl restart cassandra-b.service
- 12:09 btullis: root@aqs1012:/srv/cassandra-b/tmp# systemctl restart cassandra-b.service
- 09:25 btullis: btullis@cumin1001:~$ sudo transfer.py aqs1013.eqiad.wmnet:/srv/cassandra-b/tmp/local_group_default_T_pageviews_per_article_flat an-presto1001.eqiad.wmnet:/srv/cassandra_migration/aqs1013-b/
- 09:17 btullis: btullis@cumin1001:~$ sudo transfer.py aqs1012.eqiad.wmnet:/srv/cassandra-b/tmp/local_group_default_T_pageviews_per_article_flat an-presto1001.eqiad.wmnet:/srv/cassandra_migration/aqs1012-b/
- 09:16 btullis: btullis@cumin1001:~$ sudo transfer.py aqs1012.eqiad.wmnet:/srv/cassandra-b/tmp/local_group_default_T_pageviews_per_article_flat an-presto1001.eqiad.wmnet:/srv/cassandra_migration/cassandra_migration/aqs1012-b/
2021-10-15
[edit]- 08:33 btullis: btullis@aqs1007:~$ sudo nodetool-b clearsnapshot
2021-10-13
[edit]- 19:49 mforns: re-ran cassandra-daily-coord-local_group_default_T_pageviews_per_article_flat for 2021-10-12 successfully
- 17:58 ottomata: deleting files on stat1008 in /tmp older than 10 days and larger than 20M sudo find /tmp -mtime +10 -size +20M | xargs sudo rm -rfv
- 17:54 ottomata: removed /tmp/spark-* files belonging to aikochou on stat1008
2021-10-12
[edit]- 15:43 btullis: btullis@aqs1008:~$ sudo nodetool-b clearsnapshot
- 13:17 btullis: btullis@analytics1069:~$ sudo shutdown -h now
- 13:15 btullis: btullis@analytics1069:~$ sudo systemctl stop hadoop-hdfs-*
- 13:14 btullis: btullis@analytics1069:~$ sudo systemctl stop hadoop-yarn-nodemanager.service
- 07:26 joal: Rerun cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2021-10-11
2021-10-11
[edit]- 07:37 joal: rerun refine_event for `event`.`mediawiki_content_translation_event` year=2021/month=10/day=10/hour=16
2021-10-10
[edit]- 18:07 joal: Rerun webrequest-load-wf-text-2021-10-10-10 - failed due to network issue
2021-10-06
[edit]- 14:30 elukey: upgrade stat1005 to ROCm 4.2.0
- 13:20 btullis: btullis@aqs1004:~$ sudo nodetool-a clearsnapshot
- 10:20 elukey: upgrade ROCm to 4.2 on stat1008
2021-10-05
[edit]- 11:28 elukey: failover analytics-hive back to an-coord1001 after maintenance
2021-10-04
[edit]- 16:56 elukey: restart java daemons on an-coord1001 (standby)
- 13:43 elukey: failover analytics-hive to an-coord1002 (to restart java daemons on 1001)
- 07:43 joal: Kill-restart mediawiki-history-reduced job after deploy (more ressources)
- 07:32 joal: Deploy refinery to hdfs
- 07:10 joal: Deploy refinery for mediawiki-history-reduced hotfix
- 06:56 joal: Kill-restart pageview-monthly_dump-coord to apply fix for SLA
2021-10-01
[edit]- 15:11 btullis: sudo -u analytics kerberos-run-command analytics /usr/local/bin/refine_eventlogging_legacy --ignore_failure_flag=true --table_include_regex='editoractivation' --since='2021-09-29T22:00:00.000Z' --until='2021-09-30T23:00:00.000Z'
2021-09-30
[edit]- 19:55 ottomata: not changing to stats uid to 499; it already exists as a another system user
- 19:54 ottomata: changing stats uid and gid on an-launcher1002 and stat1005 to 499
- 09:32 btullis: btullis@an-launcher1002:~$ sudo -u analytics kerberos-run-command analytics /usr/local/bin/refine_netflow --ignore_failure_flag=true --since=2021-09-28T11:00:00 --until 2021-09-28T12:00:00
2021-09-29
[edit]- 09:16 elukey: restart hive-* units on an-coord1002 for openjdk upgrades (standby node)
2021-09-28
[edit]- 13:14 btullis: Deployed refinery using scap, then deployed onto hdfs
- 12:34 btullis: deploying refinery
- 09:55 btullis: btullis@cumin1001:~$ sudo cumin --mode async 'aqs100*.eqiad.wmnet' 'nodetool-a snapshot -t T291472 local_group_default_T_pageviews_per_article_flat' 'nodetool-b snapshot -t T291472 local_group_default_T_pageviews_per_article_flat'
- 09:36 elukey: restart java daemons on an-test-coord1001 to pick up new openjdk
2021-09-27
[edit]- 11:18 btullis: btullis@stat1005:~$ sudo apt purge usrmerge
- 11:11 btullis: btullis@stat1005:~$ sudo apt install usrmerge
2021-09-24
[edit]- 22:33 razzi: restart an-test-coord presto coordinator service to experiment withweb-ui.authentication.type=fixed
- 15:06 btullis: btullis@cumin1001:~$ sudo cumin --mode async 'aqs100[4,7].eqiad.wmnet' 'nodetool-a snapshot -t T291469' 'nodetool-b snapshot -t T291469'
- 14:47 btullis: btullis@aqs1007:~$ sudo nodetool-a repair --full local_group_default_T_mediarequest_per_file data
- 11:02 btullis: btullis@an-master1001:~$ sudo systemctl restart hadoop-mapreduce-historyserver
- 10:47 btullis: btullis@an-master1002:~$ sudo systemctl restart hadoop-hdfs-namenode
- 10:47 btullis: btullis@an-master1002:~$ sudo systemctl restart hadoop-hdfs-zkfc
- 10:35 btullis: btullis@an-master1001:~$ sudo -u hdfs kerberos-run-command hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
- 10:07 btullis: btullis@an-launcher1002:~$ sudo -u analytics kerberos-run-command analytics /usr/local/bin/refine_eventlogging_legacy --ignore_failure_flag=true --table_include_regex='centralnoticeimpression' --since='2021-09-23T04:00:00.000Z' --until='2021-09-24T05:00:00.000Z'
2021-09-22
[edit]- 17:23 razzi: razzi@an-test-coord1001:/etc/presto$ sudo systemctl restart presto-server
- 17:05 joal: Kill-restart oozie jobs after deploy (mediawiki-history-denormalize-coord, mediawiki-history-check_denormalize-coord, mediawiki-history-dumps-coord, mediawiki-history-reduced-coord)
- 11:54 joal: release refiner-source v0.1.18 to archiva with Jenkins
2021-09-20
[edit]- 08:12 elukey: remove old /reportcard (password protected, old files from 2012) httpd settings for stats.wikimedia.org
2021-09-18
[edit]- 06:48 joal: Rerun webrequest-load-wf-text-2021-9-18-0 for errors after yesterday night production issue
2021-09-17
[edit]- 16:03 btullis: Cleared all snapshots on aqs100[47] to reclaim space with nodetool-[ab] clearsnapshot (T249755)
- 15:15 btullis: btullis@aqs1004:~$ sudo nodetool-a repair --full && sudo nodetool-b repair --full (T249755)
- 10:18 btullis: btullis@an-web1001:~$ sudo find /srv/published-rsynced -user systemd-coredump -exec chown stats {} \;
- 09:47 milimetric: deployed refinery to sync sanitize allowlist, deleting event_sanitized data per decision in the task
- 08:21 elukey: disable mod_cgi/mod_cgid on an-web1001 (and remove cgi-perl related httpd configs/settings)
2021-09-16
[edit]- 19:25 ottomata: pointing analytics-web cname at new an-web1001, this moves stats and analytics .wm.org from thorium to an-web1001 - T285355
- 18:30 joal: Create HDFS home folder for user 'analytics-research'
- 07:03 elukey: stop jupyter-kaywong-singleuser.service on stat1005 to allow puppet to clean up
2021-09-15
[edit]- 16:26 joal: Deploying refinery
2021-09-13
[edit]- 18:25 razzi: (I stopped replication earlier but forgot to !log)
- 18:24 razzi: razzi@dbstore1007:~$ for socket in /run/mysqld/*; do sudo mysql --socket=$socket -e "START SLAVE"; done - reenable replication for T290841
- 18:19 razzi: razzi@dbstore1007:~$ sudo systemctl restart mariadb@s4.service for T290841
- 18:13 razzi: razzi@dbstore1007:~$ sudo systemctl restart mariadb@s3.service for T290841
- 18:05 razzi: sudo systemctl restart mariadb@s2.service
2021-09-07
[edit]- 11:41 joal: Restarting cassandra hourly loading job after C2 snapshot taken and C3 tables truncated
- 11:37 joal: Re-Add test rows in cassandra3 cluster after tables got truncated
- 10:25 hnowlan: truncating data tables on aqs_next cluster
- 10:12 joal: Kill cassandra-hourl loading job for cluster-migration first step
2021-09-03
[edit]- 11:43 joal: Deploying refinery to hotfix mediarequest cassandra3 loading jobs (second)
- 09:57 joal: Deploy AQS on new AQS servers
- 09:45 joal: Kill-restart mediarequest-top cassandra loading jobs after deploy
- 09:12 joal: Rerun mediawiki-history-denormalize-wf-2021-08 after failure
- 09:07 joal: Deploying refinery to hotfix mediarequest cassandra3 loading jobs
2021-09-01
[edit]- 16:44 mforns: finished one-off deployment of refinery to fix cassandra3 loading
- 15:57 joal: Kill cassandra loading jobs and restart them after deploy
- 15:55 mforns: starting one-off deployment of refinery to fix cassandra3 loading
- 13:15 joal: Restart cassandra jobs to load cassandra3 with spark
- 08:21 joal: Rerun webrequest-load-wf-upload-2021-9-1-0
2021-08-31
[edit]- 23:25 mforns: finished deployment of refinery (regular weekly train v0.1.17) successfully, only an-test-coord1001.eqiad.wmnet failed
- 22:41 mforns: starting deployment of refinery (regular weekly train v0.1.17)
- 22:27 mforns: Deployed refinery-source using jenkins
- 10:30 hnowlan: sudo cookbook sre.aqs.roll-restart aqs-next
2021-08-30
[edit]- 06:53 elukey: drop an-airflow1001's old airflow logs to fix root partition almost filled up
2021-08-26
[edit]- 06:22 elukey: root@an-launcher1002:/var/lib/puppet/clientbucket# find -type d -empty -delete
- 06:21 elukey: root@an-launcher1002:/var/lib/puppet/clientbucket# find -type f -delete -mtime +60
2021-08-25
[edit]- 13:40 joal: Kill restart pageview-monthly_dump job and 2 backfilling jobs
- 13:34 joal: Deploy refinery onto HDFS
- 13:09 joal: Deploying refinery using scap
2021-08-24
[edit]- 10:30 btullis: btullis@an-launcher1002:~$ sudo systemctl start hdfs-balancer.service
2021-08-20
[edit]- 08:46 btullis: btullis@druid1001:~$ sudo systemctl stop druid-broker druid-coordinator druid-historical druid-middlemanager druid-overlord
2021-08-19
[edit]- 19:05 razzi: razzi@deploy1002:/srv/deployment/analytics/aqs/deploy$ scap deploy "Deploy aqs 9c062f2"
- 19:02 razzi: note that the aqs-deploy repo's commit message DOES NOT include the changes of aqs in its changes list (though it has the correct SHA in the first line)
- 18:26 razzi: Beginning aqs deploy process
- 17:55 razzi: razzi@labstore1007:~$ sudo systemctl start analytics-dumps-fetch-geoeditors_dumps.service
- 17:53 razzi: sudo systemctl start analytics-dumps-fetch-geoeditors_dumps.service on labstore1006
2021-08-18
[edit]- 17:37 btullis: on an-coord1001: MariaDB [superset_production]> update clusters set broker_host='an-druid1001.eqiad.wmnet' where cluster_name='analytics-eqiad';
- 15:08 joal: Restart oozie jobs loading druid to use new druid-host
- 08:55 joal: Deploying refinery with scap
2021-08-13
[edit]- 16:46 elukey: cleanup /srv/discovery on stat1007 after https://gerrit.wikimedia.org/r/c/operations/puppet/+/712422
- 15:16 milimetric: reran the other three failed jobs successfully
- 14:52 milimetric: rerunning webrequest-druid-hourly-wf-2021-8-13-13 because of failure to connect to Hive metastore
2021-08-12
[edit]- 14:46 btullis: btullis@druid1002:/etc/zookeeper/conf$ sudo systemctl disable druid-broker druid-coordinator druid-historical druid-middlemanager druid-overlord
- 14:45 btullis: btullis@druid1002:/etc/zookeeper/conf$ sudo systemctl stop druid-broker druid-coordinator druid-historical druid-middlemanager druid-overlord
2021-08-11
[edit]- 19:43 btullis: btullis@druid1003:~$ sudo systemctl stop druid-overlord && sudo systemctl disable druid-overlord
- 19:41 btullis: btullis@druid1003:~$ sudo systemctl stop druid-historical && sudo systemctl disable druid-historical
- 19:40 btullis: btullis@druid1003:~$ sudo systemctl stop druid-coordinator && sudo systemctl disable druid-coordinator
- 19:37 btullis: btullis@druid1003:~$ sudo systemctl stop druid-broker && sudo systemctl disable druid-broker
- 19:30 btullis: btullis@druid1003:~$ curl -X POST http://druid1003.eqiad.wmnet:8091/druid/worker/v1/disable
- 12:13 btullis: migration of zookeeper from druid1002 to an-druid1002 complete, with quorum and two zynced followers. Re-enabling puppet on all druid nodes.
- 09:48 btullis: suspended the following oozie jobs in hue: webrequest-druid-hourly-coord, pageview-druid-hourly-coord, edit-hourly-druid-coord
- 09:45 btullis: btullis@an-launcher1002:~$ sudo systemctl disable eventlogging_to_druid_editattemptstep_hourly.timer eventlogging_to_druid_navigationtiming_hourly.timer eventlogging_to_druid_netflow_hourly.timer eventlogging_to_druid_prefupdate_hourly.timer
- 09:21 elukey: run "sudo find /var/log/airflow -type f -mtime +15 -delete" on an-airflow1001 to free space (root partition almost full)
2021-08-10
[edit]- 17:27 razzi: resume the following schedules in hue: edit-hourly-druid-coord, pageview-druid-hourly-coord, webrequest-druid-hourly-coord
- 17:10 razzi: sudo cookbook sre.druid.roll-restart-workers analytics (errored out)
- 09:04 btullis: btullis@an-launcher1002:~$ sudo systemctl restart eventlogging_to_druid_prefupdate_hourly.service
- 09:04 btullis: btullis@an-launcher1002:~$ sudo systemctl restart eventlogging_to_druid_netflow_daily.service
2021-08-09
[edit]- 10:45 btullis_: btullis@an-druid1003:/var/log/druid$ sudo chown -R druid:druid /srv/druid /var/log/druid
- 10:25 btullis_: btullis@an-druid1003:~$ sudo puppet agent -tv
2021-08-04
[edit]- 09:12 btullis: btullis@an-coord1001:~$ sudo systemctl start hive-metastore.service hive-server2.service
- 09:12 btullis: btullis@an-coord1001:~$ sudo systemctl stop hive-server2.service hive-metastore.service
- 09:00 btullis: sudo systemctl start hive-metastore && sudo systemctl start hive-server2
- 09:00 btullis: btullis@an-coord1002:~$ sudo systemctl stop hive-server2 && sudo systemctl stop hive-metastore
2021-08-03
[edit]- 19:23 ottomata: bump Refine to refinery version 0.1.16 to pick up normalized_host transform - now all event tables will have a new normalized_host field - T251320
- 19:02 ottomata: Deployed refinery using scap, then deployed onto hdfs
- 14:57 ottomata: rerunning webrequest refine for upload 08-03T01:00 - 0042643-210701181527401-oozie-oozi-W
2021-08-02
[edit]- 18:49 razzi: sudo cookbook sre.druid.roll-restart-workers analytics
- 17:57 razzi: sudo cookbook sre.druid.roll-restart-workers public
2021-07-30
[edit]- 22:22 razzi: razzi@cumin1001:~$ sudo cookbook sre.druid.roll-restart-workers test
2021-07-29
[edit]- 18:12 razzi: sudo cookbook sre.aqs.roll-restart aqs
2021-07-28
[edit]- 10:46 btullis: btullis@an-test-coord1001:/etc/hive/conf$ sudo systemctl start hive-metastore.service hive-server2.service
- 10:46 btullis: btullis@an-test-coord1001:/etc/hive/conf$ sudo systemctl stop hive-server2.service hive-metastore.service
2021-07-26
[edit]- 20:54 razzi: reran the failed workflow of cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2021-7-25
2021-07-22
[edit]- 18:38 ottomata: deploy refinery to an-launcher1002 for bin/gobblin job lock change
2021-07-20
[edit]- 20:30 joal: rerun webrequest timed-out instances
- 18:58 mforns: starting refinery deployment
- 18:40 razzi: razzi@an-launcher1002:~$ sudo puppet agent --enable
- 18:39 razzi: razzi@an-master1001:/var/log/hadoop-hdfs$ sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues
- 18:37 razzi: razzi@an-master1002:~$ sudo -i puppet agent --enable
- 18:34 razzi: razzi@an-master1002:~$ sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues
- 18:32 razzi: razzi@an-master1002:~$ sudo systemctl start hadoop-yarn-resourcemanager.service
- 18:31 razzi: razzi@an-master1002:~$ sudo systemctl stop hadoop-yarn-resourcemanager.service
- 18:22 razzi: sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
- 18:21 razzi: re-enable yarn queues by merging puppet patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/705732
- 17:27 razzi: razzi@cumin1001:~$ sudo -i wmf-auto-reimage-host -p T278423 an-master1001.eqiad.wmnet
- 17:17 razzi: stop all hadoop processes on an-master1001
- 16:52 razzi: starting hadoop processes on an-master1001 since they didn't failover cleanly
- 16:31 razzi: sudo bash gid_script.bash on an-maseter1001
- 16:29 razzi: razzi@alert1001:~$ sudo icinga-downtime -h an-master1001 -d 7200 -r "an-master1001 debian upgrade"
- 16:25 razzi: razzi@an-master1001:~$ sudo systemctl stop hadoop-mapreduce-historyserver
- 16:25 razzi: sudo systemctl stop hadoop-hdfs-zkfc.service on an-master1001 again
- 16:25 razzi: sudo systemctl stop hadoop-yarn-resourcemanager on an-master1001 again
- 16:23 razzi: sudo systemctl stop hadoop-hdfs-namenode on an-master1001
- 16:19 razzi: razzi@an-master1001:~$ sudo systemctl stop hadoop-hdfs-zkfc
- 16:19 razzi: razzi@an-master1001:~$ sudo systemctl stop hadoop-yarn-resourcemanager
- 16:18 razzi: sudo systemctl stop hadoop-hdfs-namenode
- 16:10 razzi: razzi@cumin1001:~$ sudo transfer.py an-master1002.eqiad.wmnet:/home/razzi/hdfs-namenode-snapshot-buster-reimage-$(date --iso-8601).tar.gz stat1004.eqiad.wmnet:/home/razzi/hdfs-namenode-fsimage
- 16:03 razzi: root@an-master1002:/srv/hadoop/name# tar -czf /home/razzi/hdfs-namenode-snapshot-buster-reimage-$(date --iso-8601).tar.gz current
- 15:57 razzi: sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace
- 15:52 razzi: sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter
- 15:37 razzi: kill yarn applications: for jobId in $(yarn application -list | awk 'NR > 2 { print $1 }'); do yarn application -kill $jobId; done
- 15:08 razzi: sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues
- 14:52 razzi: sudo systemctl stop 'gobblin-*.timer'
- 14:51 razzi: sudo systemctl stop analytics-reportupdater-logs-rsync.timer
- 14:47 razzi: Disable jobs on an-launcher1002 (see https://phabricator.wikimedia.org/T278423#7190372)
- 14:46 razzi: razzi@an-launcher1002:~$ sudo puppet agent --disable 'razzi: upgrade hadoop masters to debian buster'
- 08:32 mforns: restarted webrequest bundle (messed up a coord when trying to rerun some failed hours)
2021-07-17
[edit]- 08:54 elukey: run 'sudo find -type f -name '*.log*' -mtime +30 -delete' on an-coord1001:/var/log/hive to free space (root partition almost filled up) - T279304
2021-07-15
[edit]- 16:44 ottomata: deploying refinery and refinery-source 0.1.15 for refine job fixes - T271232
- 13:39 joal: Kill refine_event application_1623774792907_154469 to let manual run finish
- 13:35 joal: Kill currently running refine job (application_1623774792907_154014)
- 11:20 joal: Kill stuck refine application
2021-07-14
[edit]- 17:39 razzi: sudo cookbook sre.druid.roll-restart-workers public for https://phabricator.wikimedia.org/T283067
- 00:34 razzi: razzi@an-test-druid1001:~$ sudo systemctl restart zookeeper
- 00:33 razzi: razzi@an-test-druid1001:~$ sudo systemctl restart druid-coordinator
- 00:33 razzi: razzi@an-test-druid1001:~$ sudo systemctl restart druid-broker
- 00:28 razzi: razzi@an-test-druid1001:~$ sudo systemctl restart druid-middlemanager
- 00:24 razzi: razzi@an-test-druid1001:~$ sudo systemctl restart druid-overlord
- 00:24 razzi: razzi@an-test-druid1001:~$ sudo systemctl restart druid-historical
2021-07-13
[edit]- 19:29 joal: move /wmf/data/raw/eventlogging --> /wmf/data/raw/eventlogging_camus and drop /wmf/data/raw/eventlogging_legacy/*/year=2021/month=07/day=13/hour=14
- 19:02 razzi: razzi@cumin1001:~$ sudo cookbook sre.hadoop.roll-restart-workers analytics
- 13:03 joal: remove /wmf/gobblin/locks/event_default.lock to unlock gobblin event job
2021-07-12
[edit]- 18:37 joal: Move /wmf/data/raw/event to /wmf/data/raw/event_camus and /wmf/data/raw/event_gobblin to /wmf/data/raw/event
- 18:36 joal: Delete /year=2021/month=07/day=12/hour=14 of gobblin imported events
- 18:17 ottomata: stopped puppet and refines and imports for event data on an-launcher1002 in prep for gobblin finalization for event_default job
- 12:31 joal: Rerun failed webrequest hour after having checked that loss was entirely false-positive
2021-07-09
[edit]- 03:21 joal: Rerun webrequest descendent jobs for 2021-07-08T10:00 problem
2021-07-08
[edit]- 17:22 joal: Deploy refinery to HDFS
- 16:57 joal: Kill-restart webrequest oozie job after gobblin time-format change
- 16:44 joal: Deploying refinery to an-launcher and hadoop-test
- 16:05 joal: Manually add /wmf/data/raw/webrequest/webrequest_text/year=2021/month=7/day=8/hour=9/_IMPORTED
2021-07-07
[edit]- 17:03 joal: Deploy refinery to HDFS
- 16:52 joal: Deploy refinery to an-launcher1002
- 16:05 joal: Deploy refinery to test-cluster
- 13:30 joal: kill-restart webrequest using gobblin data
- 13:12 ottomata: deploying refinery to an-launcher1002 for webrequest gobblin migratoin
- 13:09 joal: Move data for webrequest camus-gobblin migration
- 13:03 ottomata: disabled camus-webrequest and gobblin-webrequest timer on an-launcher1002 in prep for migration
2021-07-06
[edit]- 17:33 joal: Deploy refinery onto HDFS
- 16:41 joal: Deploy refinery for gobblin
- 16:03 joal: Kill webrequest_test oozie job
- 15:55 joal: Drop and recreate wmf_raw.webrequest table on analytics-test-hadoop
- 15:52 joal: Moved camus and gobblin data for webrequest on analytics-test-hadoop
- 15:48 ottomata: deploying refinery to test cluster for webrequest_test gobblin job
- 14:16 ottomata: restarted aqs for july mw histroy snapshot deploy
- 13:29 joal: Run first manual empty job for webrequest_test on analytics-test-hadoop
- 13:29 joal: Clean gobblin state_store and data before starting webrequest_test on analytics-test-hadoop
2021-07-03
[edit]- 19:57 joal: rerun learning-features-actor-hourly-wf-2021-7-2-11
2021-07-02
[edit]- 13:47 joal: Reset failed timer refinery-sqoop-mediawiki-private.service
- 12:21 joal: Replacing failed data with successful data generated when testing https://gerrit.wikimedia.org/r/702877 - wmf_raw.mediawiki_private_cu_changes
- 00:04 razzi: razzi@an-coord1002:~$ sudo mount -a
- 00:04 razzi: razzi@an-coord1002:~$ sudo umount /mnt/hdfs
- 00:03 razzi: razzi@an-coord1002:~$ sudo systemctl restart hive-metastore.service
- 00:02 razzi: razzi@an-coord1002:~$ sudo systemctl restart hive-server2.service
2021-07-01
[edit]- 18:56 razzi: razzi@authdns1001:~$ sudo authdns-update
- 18:19 razzi: razzi@an-coord1001:~$ sudo mount -a
- 18:18 razzi: razzi@an-coord1001:~$ sudo umount /mnt/hdfs
- 18:17 razzi: razzi@an-coord1001:~$ sudo systemctl restart presto-server.service
- 18:16 razzi: razzi@an-coord1001:~$ sudo systemctl restart hive-metastore.service
- 18:16 razzi: sudo systemctl restart hive-server2.service
- 18:15 razzi: sudo systemctl restart oozie on an-coord1001 for https://phabricator.wikimedia.org/T283067
- 16:38 razzi: sudo authdns-update on ns0.wikimedia.org to apply https://gerrit.wikimedia.org/r/c/operations/dns/+/702689
2021-06-30
[edit]- 18:19 razzi: unmount and remount /mnt/hdfs on an-test-client1001 for java security update
2021-06-29
[edit]- 22:55 razzi: sudo systemctl restart hive-server2 on an-test-coord1001.eqiad.wmnet for T283067
- 22:53 razzi: sudo systemctl restart hive-metastore on an-test-coord1001.eqiad.wmnet for T283067
- 22:52 razzi: sudo systemctl restart presto-server on an-test-coord1001.eqiad.wmnet for T283067
- 22:51 razzi: sudo systemctl restart oozie on an-test-coord1001.eqiad.wmnet for T283067
- 13:31 ottomata: deploying refinery for weekly train
2021-06-28
[edit]- 17:00 elukey: apt-get reinstall llvm-gpu on stat100[5-8] - T285495
2021-06-25
[edit]- 08:01 elukey: reboot an-worker1101 to unblock stuck GPU
- 07:57 elukey: execute "sudo /opt/rocm/bin/rocm-smi --gpureset -d 1" on an-worker1101 as attempt to unblock the GPU
2021-06-24
[edit]- 06:38 elukey: drop hieradata/role/common/analytics_cluster/superset.yaml from puppet private repo (unused config, all the values dumplicated in the new hiera config)
- 06:34 elukey: rename superset hiera role configs in puppet private repo (to match the role change done recently) + superset restart
2021-06-23
[edit]- 17:56 ottomata: enable canary events for NavigationTiming extension streams - https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/699789
- 15:30 elukey: drop /reportupdater-queries on an-launcher1002 after https://gerrit.wikimedia.org/r/c/operations/puppet/+/701130
2021-06-22
[edit]- 14:46 XioNoX: remove decom hosts from the analytics firewall filter on cr2-eqiad - T279429
- 14:37 XioNoX: start updating analytics firewall rules to capirca generated ones on cr2-eqiad - T279429
- 14:28 XioNoX: remove decom hosts from the analytics firewall filter on cr1-eqiad - T279429
- 14:12 XioNoX: start updating analytics firewall rules to capirca generated ones on cr1-eqiad - T279429
2021-06-21
[edit]- 13:35 razzi: sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1001-eqiad-wmnet an-master1002-eqiad-wmnet
2021-06-18
[edit]- 06:37 elukey: execute "sudo find -type f -name '*.log*' -mtime +30 -delete" on an-coord1001 to free space in the root partition
2021-06-15
[edit]- 17:46 razzi: remove hdfs namenode backup on stat1004
- 17:45 razzi: enable puppet on an-launcher
- 17:45 razzi: sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues
- 16:55 razzi: sudo -i wmf-auto-reimage-host -p T278423 an-master1002.eqiad.wmnet
- 16:53 razzi: run uid script on an-master1002
- 16:33 elukey: restart hadoop-yarn-resourcemanager on an-master1001
- 16:16 razzi: sudo systemctl stop 'hadoop-*' on an-master1002
- 16:14 razzi: sudo systemctl stop hadoop-* on an-master1001, then realize I meant to do this on an-master1002, so start hadoop-*
- 16:11 razzi: downtime an-master1002
- 15:55 razzi: sudo transfer.py an-master1001.eqiad.wmnet:/srv/hadoop/backup/hdfs-namenode-snapshot-buster-reimage-2021-06-15.tar.gz stat1004.eqiad.wmnet:/home/razzi/hdfs-namenode-fsimage
- 15:42 razzi: tar -czf /srv/hadoop/backup/hdfs-namenode-snapshot-buster-reimage-$(date --iso-8601).tar.gz current on an-master1001
- 15:38 razzi: backup /srv/hadoop/name/current to /home/razzi/hdfs-namenode-snapshot-buster-reimage-2021-06-15.tar.gz on an-master1001
- 15:33 razzi: sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace
- 15:27 razzi: sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter
- 15:25 razzi: kill running yarn applications via for loop
- 15:11 razzi: sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues
- 15:09 razzi: disable puppet on an-mastesr
- 15:08 razzi: run puppet on an-masters to update capacity-scheduler.xml
- 15:02 razzi: disable puppet on an-masters
- 15:01 razzi: sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues to stop queues
- 14:35 razzi: disable jobs that use hadoop on an-launcher1002 following https://phabricator.wikimedia.org/T278423#7094641
2021-06-14
[edit]- 18:45 ottomata: remove packges from hadoop common nodes: sudo cumin 'R:Class = profile::analytics::cluster::packages::common' 'apt-get -y remove python3-pandas python3-pycountry python3-numpy python3-tz' - T275786
- 18:43 ottomata: remove packges from stat nodes: sudo cumin 'stat*' apt-get -y remove subversion mercurial tofrodos libwww-perl libcgi-pm-perl libjson-perl libtext-csv-xs-perl libproj-dev libboost-regex-dev libboost-system-dev libgoogle-glog-dev libboost-iostreams-dev libgdal-dev
- 07:18 joal: Rerun cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2021-6-11
2021-06-10
[edit]- 21:17 razzi: sudo systemctl restart monitor_refine_eventlogging_analytics
- 18:17 razzi: sudo systemctl restart hadoop-mapreduce-historyserver
- 17:24 razzi: sudo systemctl restart hadoop-hdfs-namenode on an-master1002
- 17:24 razzi: sudo systemctl restart hadoop-hdfs-zkfc on an-master1002
- 17:12 razzi: sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
- 16:25 razzi: rolling restart hadoop masters to pick up https://gerrit.wikimedia.org/r/c/operations/puppet/+/698194
- 14:07 ottomata: altered event.wmdebannerevent event.eventRate field to change type from BIGINT to DOUBLE - T282562
2021-06-08
[edit]- 16:56 elukey: move away from dbstore1004 in favor of dbstore1007 in analytics CNAME/SRV records (will affect analytics-mysql and sqoop)
- 13:42 ottomata: roll restart an-conf zookeepers - T283067
- 13:22 ottomata: roll restarting analytics presto-servers - T283067
- 06:08 elukey: restart yarn nodemanager on analytics1075 to clear the un-healthy state after some days of downtime (one-off issue but let's keep an eye on it)
2021-06-07
[edit]- 18:14 ottomata: rolling restart of kafka jumbo brokers - T283067
- 17:53 ottomata: rolling restart of kafka jumbo mirror makers - T283067
- 17:07 ottomata: remove packages from an clsuter nodes: sudo apt-get -y remove r-cran-rmysql python3-matplotlib python3-sklearn python3-enchant python3-nltk gfortran liblapack-dev libopenblas-dev - T275786
- 16:50 ottomata: restarting mysqld analytics-meta replica on db1108 to apply config change - T272973
2021-06-04
[edit]- 17:42 razzi: sudo cookbook sre.aqs.roll-restart aqs to deploy new mediawiki history snapshot
2021-06-03
[edit]- 22:32 razzi: sudo manage_principals.py create jdl --email_address=jlinehan@wikimedia.org
- 22:32 razzi: sudo manage_principals.py create phuedx --email_address=phuedx@wikimedia.org
- 15:46 ottomata: add airflow_2.1.0-py3.7-1_amd64.deb to apt.wm.org
- 15:20 ottomata: created airflow_analytics database and user on an-coord1001 analytics-meta instance - T272973
2021-06-02
[edit]- 18:09 ottomata: remove .deb packages from stat boxes: python3-mysqldb python3-boto python3-ua-parser python3-netaddr python3-pymysql python3-protobuf python3-unidecode python3-oauth2client python3-oauthlib python3-requests-oauthlib python3-ua-parser - T275786
2021-05-31
[edit]- 06:56 joal: Rerun cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2021-5-29
2021-05-27
[edit]- 14:37 elukey: removed Luca's and Tobias' emails from analytics-alerts@
- 07:01 elukey: roll restart hdfs namenodes to pick up new GC/heap settings - https://gerrit.wikimedia.org/r/c/operations/puppet/+/695933
2021-05-26
[edit]- 19:14 ottomata: deploying refinery and refinery source 0.1.13
- 17:29 ottomata: killing and restarting oozie cassandra loader jobs coord_unique_devices_daily and coord_pageview_top_percountry_daily after revert of oozie job to load to cassandra 3
- 14:18 ottomata: deploying refinery...
- 14:17 ottomata: Deployed refinery-source using jenkins
2021-05-25
[edit]- 18:16 razzi: sudo systemctl start all failed units from `systemctl list-units --state=failed` on an-launcher1002
- 18:14 razzi: sudo systemctl start eventlogging_to_druid_navigationtiming_hourly.service
- 18:01 razzi: manually edit /etc/hadoop/conf/capacity-scheduler.xml to make queues running and sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues
- 17:52 razzi: sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues on an-master1001 and an-master1002
- 17:28 razzi: sudo systemctl restart refine_eventlogging_legacy
- 17:28 razzi: sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues to enable submitting jobs once again
- 17:08 razzi: re-enabled puppet on an-masters and an-launcher
- 17:04 razzi: sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode leave
- 17:03 razzi: sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
- 16:43 razzi: sudo systemctl restart hadoop-hdfs-namenode on an-master1001
- 16:38 razzi: sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace
- 16:35 razzi: sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter
- 16:28 razzi: sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
- 16:23 razzi: sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode leave
- 16:06 razzi: sudo systemctl restart hadoop-hdfs-namenode
- 15:52 razzi: checkpoint hdfs with sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace
- 15:51 razzi: enable safe mode on an-master1001 with sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter
- 15:36 razzi: disable puppet on an-master1001.eqiad.wmnet and an-master1002.eqiad.wmnet again
- 15:35 razzi: re-enable puppet on an-masters, run puppet, and sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues
- 15:32 razzi: disable puppet on an-master1001.eqiad.wmnet and an-master1002.eqiad.wmnet
- 14:39 razzi: stop puppet on an-launcher and stop hadoop-related timers
- 01:09 razzi: sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
- 01:07 razzi: sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1001-eqiad-wmnet an-master1002-eqiad-wmnet
- 00:34 razzi: sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1001-eqiad-wmnet an-master1002-eqiad-wmnet
2021-05-24
[edit]- 18:05 ottomata: resume failing cassandra 3 oozie loading jobs, they are also loading to cassandra 2: cassandra-daily-coord-local_group_default_T_top_percountry (0011318-210426062240701-oozie-oozi-C), cassandra-daily-coord-local_group_default_T_unique_devices (0011324-210426062240701-oozie-oozi-C)
- 18:04 ottomata: suspend failing cassandra 3 oozie loading jobs: cassandra-daily-coord-local_group_default_T_top_percountry (0011318-210426062240701-oozie-oozi-C), cassandra-daily-coord-local_group_default_T_unique_devices (0011324-210426062240701-oozie-oozi-C)
- 15:19 ottomata: rm -rf /tmp/analytics/* on an-launcher1002 - T283126
2021-05-20
[edit]- 06:05 elukey: kill christinedk's jupyter process on stat1007 (offboarded user) to allow puppet to run
2021-05-19
[edit]- 16:31 razzi: restart turnilo for T279380
2021-05-18
[edit]- 20:22 razzi: restart oozie virtualpageview hourly, virtualpageview druid daily, virtualpageview druid monthly
- 18:57 razzi: deployed refinery via scap, then deployed to hdfs
- 18:46 ottomata: removing extraneous python-kafka and python-confluent-kafka deb packages from analytics cluster - T275786
- 12:40 joal: Add monitoring data in cassandra-3
- 06:50 joal: run manual unique-devices cassandra job for one day with debug logging
- 02:20 ottomata: manually running drop_event with --verbose flag
2021-05-17
[edit]- 11:09 joal: Restart cassandra-daily-wf-local_group_default_T_unique_devices-2021-5-4 for testing after host generating failures has been moved out of cluster
- 10:41 joal: Restart cassandra-daily-wf-local_group_default_T_unique_devices-2021-5-4 for testing after drop/create of keyspace
- 10:28 joal: Restart cassandra-daily-wf-local_group_default_T_unique_devices-2021-5-4 for testing
- 09:45 joal: Rerun of cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2021-5-15
2021-05-13
[edit]- 11:41 hnowlan: running truncate "local_group_default_T_pageviews_per_article_flat".data; on aqs1012
2021-05-12
[edit]- 15:17 ottomata: dropped event.mediawiki_job_* tables and data directories with mforns - T273789 & T281605
- 13:56 ottomata: removing refine_mediawiki_job Refine jobs - T281605
2021-05-11
[edit]- 21:00 mforns: finished repeated refinery deployment (matching source v0.1.11) - missed unmerged change
- 19:59 mforns: repeating refinery deployment (matching source v0.1.11) - missed unmerged change
- 19:53 mforns: finished refinery deployment (matching source v0.1.11)
- 18:41 mforns: starting refinery deployment (matching source v0.1.11)
- 17:26 mforns: deployed refinery-source v0.1.11
2021-05-06
[edit]- 21:27 razzi: sudo manage_principals.py reset-password nahidunlimited --email_address=nsultan@wikimedia.org
- 13:29 elukey: roll restart of hadoop yarn nodemanagers to pick up TasksMax=26214
- 12:39 elukey: restart Yarn RMs to apply the dominant resource calculator setting - T281792
- 12:15 hnowlan: changed eventlogging CNAME to point to eventlog1003
- 09:19 hnowlan: starting decommission of eventlog1002
2021-05-05
[edit]- 17:36 razzi: create principal for sihe: sudo manage_principals.py create sihe --email_address=silvan.heintze@wikimedia.de
- 12:22 joal: Reset monitor_refine_eventlogging_legacy after manual rerun of failed job
- 12:02 joal: rerun cassandra-daily-wf-local_group_default_T_top_percountry-2021-5-4
2021-05-04
[edit]- 20:31 joal: Kill-restart 16 cassandra jobs
- 20:29 joal: Kill-restart referer-daily job
- 20:12 joal: Deploy refinery onto HDFSb
- 19:46 joal: Deploying refinery using scap
- 19:34 joal: refinery v0.1.10 released to Archiva
2021-05-03
[edit]- 14:23 ottomata: stopping all venv based jupyter singleuser servers - T262847
- 13:59 ottomata: dropped all obselete (upper cased location) event_santizied.*_T280813 tables created for T280813
- 10:43 joal: Add _SUCCESS flag to /wmf/data/raw/mediawiki_private/tables/cu_changes/month=2021-04 after having manually sqooped missing tables
- 09:57 joal: restart refinery-sqoop-mediawiki-private timer after patch
- 09:56 joal: Reset refinery-sqoop-mediawiki-private timer
- 09:38 joal: Drop already sqooped data to restart jobs
- 08:53 joal: Deploy refinery for sqoop hotfix
- 08:33 elukey: clean up libmariadb-java from hadoop workers and clients
- 07:46 joal: Kill prod sqoop job to restart after fix
2021-04-30
[edit]- 07:04 elukey: hue restarted using the database 'hue' instead of 'hue_next'
- 06:56 elukey: stop hue to allow database rename (hue_next -> hue)
2021-04-29
[edit]- 15:55 razzi: restart hadoop-yarn-nodemanager and hadoop-hdfs-datanode on an-worker1100 for hadoop to recognize new disk /dev/sdl
- 15:38 ottomata: enabling event_sanitized_main jobs - T273789
- 14:57 elukey: run mysql_upgrade on an-coord1001 to complete the buster upgrade - T278424
- 14:44 hnowlan: restored all eventlogging jobs to eventlog1003
- 14:21 hnowlan: bump eventlog1003 CPUs to 6
- 13:53 joal: Rerun failed pageview-hourly-wf-2021-4-29-11 and pageview-hourly-wf-2021-4-29-12
- 13:09 joal: Rerun failed pageview-hourly-wf-2021-4-29-11
- 12:35 hnowlan: restarting 2 processors on eventlog1002
- 12:02 hnowlan: stopping processors on eventlog1002 to migrate to eventlog1003
- 11:50 elukey: manual stop of one of the eventlog processors on eventlog1002 to see if 1003 takes it over
- 02:59 milimetric: deployed hotfix for referrer job
2021-04-28
[edit]- 17:46 hnowlan: eventlog1003 joined to groups successfully
- 17:36 razzi: sudo mkdir /srv/log/eventlogging and sudo chown eventlogging:eventlogging /srv/log/eventlogging to workaround missing directory puppet error (to be puppetized later)
- 17:31 razzi: remove deployment cache on eventlogging1003: sudo rm -fr /srv/deployment/eventlogging/analytics-cache/
- 17:26 razzi: manually change /srv/deployment/eventlogging/analytics/.git/DEPLOY_HEAD to deployment1002 on deployment1002 to fix puppet scap error
- 16:53 hnowlan: stopping deployment-eventlog05 in deployment-prep
- 14:42 milimetric: deployed refinery with 0.1.9 jars and synced to hdfs
- 14:30 elukey: chown -R analytics-deploy:analytics-deploy /srv/deployment/analytics on an-coord1001
- 12:50 ottomata: applied data_purge jobs in analytics test cluster; old data will now be dropped there - T273789
2021-04-27
[edit]- 08:33 elukey: run mysql_upgrade for analytics-meta on an-coord1002 (should be part of the upgrade process) - T278424
- 07:11 elukey: restart yarn resource managers to pick up yarn label settings
2021-04-26
[edit]- 08:01 elukey: restart hadoop-mapreduce-historyserver on an-master1001 after changes to the yarn ui user
- 07:36 elukey: re-enable timers after setting the capacity scheduler
- 07:31 elukey: restart hadoop RM on an-master* to pick up capacity scheduler changes
- 06:44 elukey: stop timers on an-launcher1002 again as prep step for capacity scheduler changes
- 06:32 elukey: roll restart of hadoop-yarn-nodemanagers to pick up new log4j settings - T276906
- 06:25 elukey: re-enable timers
- 06:20 elukey: reboot an-coord1001 to pick up kernel security settings
- 05:57 elukey: stop timers on an-launcher1002 to allow a reboot of an-coord1001
2021-04-24
[edit]- 08:03 joal: Rerun failed webrequest-druid-hourly-wf-2021-4-23-13
2021-04-23
[edit]- 14:23 elukey: roll restart an-master100[1,2] daemons to pick up new lo4j settings - T276906
- 10:30 elukey: restart hadoop daemons (NM, DN, JN) on an-worker1080 to further test the new log4j config - T276906
- 09:12 elukey: change default log4j hadoop config to include rolling gzip appender
2021-04-21
[edit]- 21:30 ottomata: temporariliy disabling sanitize_eventlogging_analytics_delayed jobs until T280813 is completed (probably tomorrow)
- 20:04 ottomata: renaming event_santized hive table directories to lower case and repairing table partition paths - T280813
- 09:28 elukey: roll restart druid-overlord on druid* after an-coord1001 maintenance
- 09:09 elukey: upgrade hue on an-tool1009 to 4.9.0-2
- 08:31 elukey: re-enable timers on an-launcher1002 and airflow on an-airflow1001 after maintenance on an-coord1001
- 07:08 elukey: reimage an-coord1001 after partition reshape (/var/lib/mysql folded in /srv)
- 06:51 elukey: stop airflow on an-airflow1001
- 06:49 elukey: stop all services on an-coord1001 as prep step for reimage
- 06:45 elukey: PURGE BINARY LOGS BEFORE '2021-04-14 00:00:00'; on an-coord1001 to free some space before the reimage
- 06:00 elukey: stop timers on an-launcher1002 as prep step for an-coord1001 reimage
2021-04-20
[edit]- 15:51 elukey: move analytics-hive.eqiad.wmnet back to an-coord1001 (test on an-coord1002 successful)
- 15:38 ottomata: deployed refiner to hdfs
- 13:59 ottomata: deploying refinery and refinery source 0.1.6 for weekly train
- 13:37 ottomata: deployed aqs
- 13:16 elukey: failover analytics-hive to an-coord1002 to test the host (running on buster)
- 12:40 elukey: PURGE BINARY LOGS BEFORE '2021-04-12 00:00:00'; on an-coord1001 - T280367
2021-04-19
[edit]- 16:45 ottomata: make RefineMonitor use analytics keytab - this should be a no-op
- 16:07 razzi: run kafka preferred-replica-election on jumbo cluster (kafka-jumbo1002)
- 06:50 elukey: move /var/lib/hadoop/name partition under /srv/hadoop/name on an-master1001 - T265126
- 05:45 elukey: cleanup Lex's jupyter notebooks on stat1007 to allow puppet to clean up
2021-04-18
[edit]- 07:25 elukey: run "PURGE BINARY LOGS BEFORE '2021-04-11 00:00:00';" on an-coord1001 to free some space - T280367
2021-04-16
[edit]- 15:14 elukey: execute PURGE BINARY LOGS BEFORE '2021-04-09 00:00:00'; on an-coord1001 to free space for /var/lib/mysql - T280367
- 15:13 elukey: execute PURGE BINARY LOGS BEFORE '2021-04-09 00:00:00';
- 07:54 elukey: drop all the cloudera packages from our repositories
2021-04-15
[edit]- 21:13 razzi: rebalance kafka partitions for webrequest_text partition 23
- 14:56 elukey: deploy refinery via scap - weekly train
- 09:50 elukey: rollback hue on an-tool1009 to 4.8, it seems that 4.9 still has issues
- 06:32 elukey: move hue.wikimedia.org to an-tool1009 (from analytics-tool1001)
- 01:36 razzi: rebalance kafka partitions for webrequest_text partitions 21,22
2021-04-14
[edit]- 14:05 elukey: run build/env/bin/hue migrate on an-tool1009 after the hue upgade
- 13:10 elukey: rollback hue-next to 4.8 - issues not present in staging
- 13:00 elukey: upgrade Hue to 4.9 on an-tool1009 - hue-next.wikimedia.org
- 10:02 elukey: roll restart yarn nodemanagers on hadoop prod (attempt to see if they entered in a weird state, graceful restart)
- 09:54 elukey: kill long running mediawiki-job refine erroring out application_1615988861843_166906
- 09:46 elukey: kill application_1615988861843_163186 for the same reason
- 09:43 elukey: kill application_1615988861843_164387 to see if any improvement to socket consumption is made
- 09:14 elukey: run "sudo kill `pgrep -f sqoop`" on an-launcher1002 to clean up old test processes still running
2021-04-13
[edit]- 16:17 razzi: rebalance kafka partitions for webrequest_text partitions 19, 20
- 13:18 ottomata: Refine now uses refinery-job 0.1.4; RefineFailuresChecker has been removed and its function rolled into RefineMonitor -
- 10:23 hnowlan: deploying aqs with updated cassandra libraries to aqs1004 while depooled
- 06:17 elukey: kill application application_1615988861843_158645 to free space on analytics1070
- 06:10 elukey: kill application_1615988861843_158592 on analytics1061 to allow space to recover (truncate of course in D state)
- 06:05 elukey: truncate logs for application_1615988861843_158592 on analytics1061 - one partition full
2021-04-12
[edit]- 14:21 ottomata: stop using http proxies for produce_canary_events_job - T274951
2021-04-08
[edit]- 16:33 elukey: reboot an-worker1100 again to check if all the disks come up correctly
- 15:43 razzi: rebalance kafka partitions for webrequest_text partitions 17, 18
- 15:35 elukey: reboot an-worker1100 to see if it helps with the strange BBU behavior in T279475
- 14:07 elukey: drop /var/spool/rsyslog from stat1008 - corrupted files due to root partition filled up caused a SEGV for rsyslog
- 11:14 hnowlan: created aqs user and loaded full schemas into analytics wmcs cassandra
- 08:35 elukey: apt-get clean on stat1008 to free some space
- 07:44 elukey: restart hadoop hdfs masters on an-master100[1,2] to apply the new log4j settings fro the audit log
- 06:44 elukey: re-deployed refinery to hadoop-test after fixing permissions on an-test-coord1001
2021-04-07
[edit]- 23:03 ottomata: installing anaconda-wmf-2020.02~wmf5 on remaining nodes - T279480
- 22:51 ottomata: installing anaconda-wmf-2020.02~wmf5 on stat boxes - T279480
- 22:47 mforns: finished refinery deployment up to 1dbbd3dfa996d2e970eb1cbc0a63d53040d4e3a3
- 22:39 mforns: deployment of refinery via scap to hadoop-test failed with Permission denied: '/srv/deployment/analytics/refinery-cache/.config' (deployemt to production went fine)
- 21:44 mforns: starting refinery deploy up to 1dbbd3dfa996d2e970eb1cbc0a63d53040d4e3a3
- 21:26 mforns: deployed refinery-source v0.1.4
- 21:25 razzi: sudo apt-get install --reinstall sudo apt-get install --reinstall anaconda-wmf on stat1008
- 20:15 razzi: rebalance kafka partitions for webrequest_text partitions 15, 16
- 19:53 ottomata: upgrade anaconda-wmf everywhere to 2020.02~wmf4 with fixes for T279480
- 14:03 hnowlan: setting profile::aqs::git_deploy: true in aqs-test1001 hiera config
2021-04-06
[edit]- 22:34 razzi: rebalance kafka partitions for webrequest_text_13,14
- 09:37 elukey: reimage an-coord1002 to Debian Buster
2021-04-05
[edit]- 16:07 razzi: remove old hive logs on an-coord1001: sudo rm /var/log/hive/hive-*.log.2021-02-*
- 14:54 razzi: remove empty /var/log/sqoop on an-launcher1002 (logs go in /var/log/refinery); sudo rmdir /var/log/sqoop
- 14:51 razzi: rebalance kafka partitions for webrequest_text partitions 11, 12
2021-04-02
[edit]- 16:28 razzi: rebalance kafka partitions for webrequest_text partitions 9,10
- 16:19 elukey: all the Hadoop test cluster on Debian Buster
- 07:28 elukey: manual fix for an-worker1080's interface in netbox (xe-4/0/11), moved by mistake to public-1b
2021-04-01
[edit]- 20:27 razzi: restore superset_production from backup superset_production_1617306805.sql
- 20:14 razzi: manually run bash /srv/deployment/analytics/superset/deploy/create_virtualenv.sh as analytics_deploy on an-tool1010, since somehow it didn't run with scap
- 20:01 razzi: sudo chown -R analytics_deploy:analytics_deploy /srv/deployment/analytics/superset/venv since it's owned by root and needs to be removed upon deployment
- 19:54 razzi: dump superset production to an-coord1001.eqiad.wmnet:/home/razzi/superset_production_1617306805.sql just in case
- 16:50 razzi: rebalance kafka partitions for webrequest_text partitions 7 and 8
2021-03-31
[edit]- 14:18 hnowlan: starting copy of large tables from aqs1007 to aqs1011
2021-03-30
[edit]- 20:25 joal: Kill-Restart data_quality_stats-hourly-bundle after deploy
- 20:19 joal: Deploying refinery onto HDFS
- 19:57 joal: Deploying refinery using scap
- 19:57 joal: Refinery-source released to archiva and new jars commited to refinery (v0.1.3)
- 17:07 razzi: rebalance kafka partitions for webrequest_text partitions 5 and 6
- 12:35 hnowlan: Depooling aqs1004 for another transfer of local_group_default_T_pageviews_per_article_flat
- 12:30 elukey: restart reportupdater-codemirror on an-launcher1002 fro T275757
- 11:30 elukey: ERRATA: upgrade to 2.3.6-2
- 11:29 elukey: upgrade hive client packages to 2.3.6-1 on an-launcher1002 (already applied to all stat100x)
2021-03-25
[edit]- 15:58 elukey: disable vmemory checks in Yarn nodemanagers on Hadoop
- 13:53 elukey: systemctl restart performance-asotranking on stat1007 for T276121
- 08:14 elukey: upgrade hive packages on stat100x to 2.6.3-2 - T276121
- 08:12 elukey: upgrade hive packages in thirdparty/bigtop15 to 2.3.6-2 for buster-wikimedia
2021-03-24
[edit]- 18:49 elukey: systemctl restart refinery-import-* failed jobs (/mnt/hdfs errors due to me umounting the mountpoint)
- 18:43 elukey: kill fuse hdfs mount process on an-launcher1002, re-mounted /mnt/hdfs, too many processes in D state
- 15:46 razzi: rebalance kafka partitions for webrequest_text partitions 3 and 4
- 05:40 razzi: sudo chown analytics /var/log/refinery/sqoop-mediawiki.log.1 on an-launcher1002 and restart logrotate
2021-03-22
[edit]- 18:12 elukey: drop /srv/.hardsync* to clean up hardlinks not needed
- 18:07 elukey: run rm -rfv .hardsync.*/archive/public-datasets/* on thorium:/srv to clean up files to drop (didn't work)
- 18:01 elukey: drop /srv/.hardsync*trash* on thorium - old hardlinks that should have been trashed
- 15:52 razzi: rebalance kafka partitions for webrequest_text partition 2
- 09:28 elukey: move the yarn scheduler in hadoop test to capacity
2021-03-19
[edit]- 15:44 razzi: rebalance kafka partitions for webrequest_text partition 1
2021-03-18
[edit]- 19:30 razzi: rename /usr/lib/python2.7/dist-packages/cqlshlib/copyutil.so back
- 19:29 razzi: temporarily rename /usr/lib/python2.7/dist-packages/cqlshlib/copyutil.so on aqs1004 to fix https://issues.apache.org/jira/browse/CASSANDRA-11574
- 19:02 ottomata: hdfs dfs -chgrp -R analytics-privatedata-users /wmf/camus - T275396
- 16:47 razzi: rebalance kafka partitions for webrequest_text partition 0
- 06:32 elukey: force a manual run of create_virtualenv.sh on an-tool1010 - superset down
2021-03-17
[edit]- 20:45 razzi: release wikistats 2.9.0
- 20:15 ottomata: install anaconda-wmf 2020.02~wmf3 on analytics cluster clients and workers - T262847
- 18:10 ottomata: started oozie/cassandra/coord_pageview_top_percountry_daily
- 15:21 razzi: rebalance kafka partitions for webrequest_upload partitions 22 and 23
- 13:54 razzi: sudo cookbook sre.hosts.reboot-single an-conf1001.eqiad.wmnet
- 13:47 razzi: sudo cookbook sre.hosts.reboot-single an-conf1003.eqiad.wmnet
- 13:41 razzi: sudo cookbook sre.hosts.reboot-single an-conf1002.eqiad.wmnet
- 13:39 ottomata: deploying refinery for weekly train
- 13:28 ottomata: deploy aqs as part of train - T207171, T263697
- 01:28 razzi: rebalance kafka partitions for webrequest_upload partition 21
2021-03-16
[edit]- 14:43 razzi: rebalance kafka partitions for webrequest_upload partition 20
- 03:17 razzi: rebalance kafka partitions for webrequest_upload partition 19
2021-03-15
[edit]- 16:53 razzi: rebalance kafka partitions for webrequest_upload partition 18
- 08:25 elukey: stop/start hdfs-balancer on an-launcher1002 with bw 200MB
- 07:48 joal: Manually start mediawiki-history-drop-snapshot.service to check the run succeeds
- 07:47 joal: Drop hive wmf.mediawiki_wikitext_history snapshot partitions (2020-08, 2020-09, 2020-10, 2020-11)
2021-03-14
[edit]- 20:49 joal: Manually clean some data ( mediawiki-history-drop-snapshot.service seems not working)
- 20:46 joal: Force a run of mediawiki-history-drop-snapshot.service to clean up some data
2021-03-12
[edit]- 17:20 elukey: kill duplicate mediawiki-wikitext-history coordinator failing and sending emails to alerts@
- 07:21 elukey: re-run monitor_refine_event_failure_flags
2021-03-11
[edit]- 22:31 razzi: rebalance kafka partitions for webrequest_upload partition 17
- 20:20 razzi: disable maintenance mode for matomo1002
- 20:08 razzi: starting reboot of matomo1002 for kernel upgrade
- 18:52 razzi: systemctl restart hadoop-hdfs-datanode on analytics1059
- 18:50 razzi: systemctl restart hadoop-yarn-nodemanager on analytics1059
- 18:35 razzi: apt-get install parted on analytics1059
- 15:34 razzi: rebalance kafka partitions for webrequest_upload partition 17
- 10:52 elukey: drop /home/bsitzmann on all stat100x hosts - T273712
- 08:25 elukey: drop database dedcode cascade in hive - T276748
- 08:15 elukey: hdfs dfs -rmr /user/dedcode on an-launcher1002 (data in trash for a month) - T276748
2021-03-10
[edit]- 23:15 razzi: rebalance kafka partitions for webrequest_upload partition 16
- 18:44 mforns: finished deployment of refinery (session length oozie job)
- 18:16 mforns: starting deployment of refinery (session length oozie job)
- 16:54 razzi: rebalance kafka partitions for webrequest_upload partition 15
- 07:05 elukey: all hadoop worker nodes on Buster
- 06:28 elukey: force the re-run of refine_eventlogging_legacy - failed due to worker reimage in progress
- 06:17 elukey: reimage an-worker1111 to buster
2021-03-09
[edit]- 22:00 razzi: rebalance kafka partitions for webrequest_upload partition 14
- 20:42 elukey: reimaged an-worker1091 to buster
- 18:26 elukey: reimage an-worker1087 to buster
- 16:40 elukey: reimage analytics1077 to buster
- 15:36 razzi: rebalance kafka partitions for webrequest_upload partition 13
- 15:18 elukey: reimage analytics1072 (hadoop hdfs journal node) to buster
- 14:29 elukey: drain + reimage an-worker1090/89 to Buster
- 13:26 elukey: reimage an-worker1102 and an-worker1080 (hdfs journal node) to Buster
- 12:59 elukey: drain + reimage an-worker1103 to Buster
- 09:14 elukey: drain + reimage analytics1076 and an-worker1112 to Buster
- 07:01 elukey: drain + reimage an-worker109[4,5] to Buster
2021-03-08
[edit]- 23:22 razzi: rebalance kafka partitions for webrequest_upload partition 12
- 18:49 razzi: rebalance kafka partitions for webrequest_upload partition 11
- 18:11 elukey: drain + reimage an-worker11[15,16] to Buster
- 17:12 elukey: drain + reimage an-worker11[13,14] to Buster
- 16:17 elukey: drain + reimage an-worker1109/1110 to Buster
- 14:54 elukey: drain + reimage an-worker110[7,8] to Buster
- 14:52 ottomata: altered topics (eqiad|codfw).mediawiki.client.session_tick to have 2 partitions - T276502
- 13:51 elukey: drain + reimage an-worker110[4,5] to Buster
- 10:41 elukey: drain + reimage an-worker1104/1089 to Debian Buster
- 09:19 elukey: drain + reimage an-worker108[3,4] to Buster
- 08:20 elukey: drain + reimage an-worker108[1,2] to Buster
- 07:23 elukey: drain + reimage analytics107[4,5] to Buster
2021-03-07
[edit]- 08:00 elukey: "megacli -LDSetProp -ForcedWB -Immediate -Lall -aAll" on analytics1066
- 07:49 elukey: umount /var/lib/hadoop/data/e on analytics1059 and restart hadoop daemons to exclude failed disk - T276696
2021-03-05
[edit]- 18:30 razzi: run again sudo -i wmf-auto-reimage-host -p T269211 clouddb1021.eqiad.wmnet --new
- 18:18 razzi: sudo cookbook sre.dns.netbox -t T269211 "Move clouddb1021 to private vlan"
- 18:17 razzi: re-run interface_automation.ProvisionServerNetwork with private vlan
- 18:16 razzi: delete non-mgmt interface for clouddb1021
- 17:07 razzi: sudo -i wmf-auto-reimage-host -p T269211 clouddb1021.eqiad.wmnet --new
- 16:54 razzi: sudo cookbook sre.dns.netbox -t T269211 "Reimage and rename labsdb1012 to clouddb1021"
- 16:52 razzi: run script at https://netbox.wikimedia.org/extras/scripts/interface_automation.ProvisionServerNetwork/
- 16:47 razzi: edit https://netbox.wikimedia.org/dcim/devices/2078/ device name from labsdb1012 to clouddb1021
- 16:30 razzi: delete non-mgmt interfaces for labsdb1012 at https://netbox.wikimedia.org/dcim/devices/2078/interfaces/
- 16:28 razzi: rename https://netbox.wikimedia.org/ipam/ip-addresses/734/ DNS name from labsdb1012.mgmt.eqiad.wmnet to clouddb1021.mgmt.eqiad.wmnet
- 16:08 razzi: sudo cookbook sre.hosts.decommission labsdb1012.eqiad.wmnet -t T269211
- 15:52 razzi: stop mariadb on labsdb1012
- 15:39 razzi: rebalance kafka partitions for webrequest_upload partition 10
- 15:07 elukey: drain + reimage analytics1073 and an-worker1086 to Debian Buster
- 13:36 elukey: roll restart HDFS Namenodes for the Hadoop cluster to pick up new Xmx settings (https://gerrit.wikimedia.org/r/c/operations/puppet/+/668659)
- 10:20 elukey: force run of refinery-druid-drop-public-snapshots to check Druid public's performances
- 10:06 elukey: failover HDFS Namenode from 1002 to 1001 (high GC pauses triggered the HDFS zkfc daemon on 1001 and the failover to 1002)
- 08:32 elukey: drain + reimage an-worker107[8,9] to Debian Buster (one Journal node included)
- 07:22 elukey: drain + reimage analytics107[0-1] to debian buster
- 07:13 elukey: add analytis1066 back with /dev/sdb removed
- 07:01 elukey: stop hadoop daemons on analytics1066 - disk errors on /dev/sdb after reimage
2021-03-04
[edit]- 21:19 razzi: rebalance kafka partitions for webrequest_upload partition 9
- 16:27 elukey: drain + reimage analytics106[8,9] to Debian Buster (one is a journalnode)
- 15:12 elukey: drain + reimage analytics106[6,7] to Debian Buster
- 14:21 elukey: drain + reimage analytics1065 to Debian Buster
- 13:32 elukey: drain + reimage analytics10[63,64] to Debian Buster
- 12:48 elukey: drain + reimage analytics10[61,62] to Debian Buster
- 10:40 elukey: drain + reimage analytics1059/1060 to Debian Buster
- 09:32 elukey: reboot an-worker[1097-1101] (GPU workers) to pick up the new kernel (5.10)
- 09:02 elukey: kill/start mediawiki-geoeditors-monthly to apply backtick change (hive script)
- 08:48 elukey: deploy refinery to hdfs
- 08:34 elukey: deploy refinery to fix https://gerrit.wikimedia.org/r/c/analytics/refinery/+/668111
- 07:38 elukey: reboot an-worker1096 to pick up 5.10 kernel
2021-03-03
[edit]- 17:10 elukey: update druid datasource on aqs (roll restart of aqs on aqs100*)
- 17:06 razzi: rebalance kafka partitions for webrequest_upload partition 8
- 14:20 elukey: reimage an-worker1099,1100,1101 (GPU worker nodes) to Debian Buster
- 10:16 elukey: add an-worker113[2,5-8] to the Analytics Hadoop cluster
2021-03-02
[edit]- 23:15 mforns: finished deployment of refinery to hdfs
- 21:59 mforns: starting refinery deployment using scap
- 21:48 mforns: deployed refinery-source v0.1.2
- 17:26 razzi: rebalance kafka partitions for webrequest_upload partition 7
- 13:42 elukey: Add an-worker11[19,20-28,30,31] to Analytics Hadoop
- 10:21 elukey: roll restart druid historicals on druid public to pick up new cache settings (enable segment caching)
- 10:14 elukey: roll restart druid brokers on druid public to pick up new cache settings (no segment caching, only query caching)
- 08:01 elukey: manual start of performance-asotranking on stat1007 (requested by Gilles) - T276121
2021-03-01
[edit]- 21:24 razzi: rebalance kafka partitions for webrequest_upload partition 6
- 18:14 razzi: restart timer that wasn't running on an-worker1101: sudo systemctl restart prometheus-debian-version-textfile.timer
- 17:40 elukey: reimage an-worker1098 (GPU worker node) to Buster
- 14:48 elukey: reimage an-worker1097 (gpu node) to debian buster
- 11:55 elukey: roll restart druid broker on druid-analytics (again) to enable query cache settings (missing config due to typo)
- 11:34 elukey: roll restart historical daemons (again) on druid-analytics to remove stale config and enable (finally) segment caching.
- 11:02 elukey: roll restart druid-broker and druid-historical daemons on druid-analytics to pick up new cache settings (disable segment caching on broker and enable it on historicals)
- 09:12 elukey: restart hadoop daemons on an-worker1112 to pick up the new disk
- 09:11 elukey: remount /dev/sdl on an-worker1112 (wasn't able to make it fail)
2021-02-26
[edit]- 16:03 razzi: rebalance kafka partitions for webrequest_upload partition 4
- 12:33 elukey: reimaged an-worker1096 (GPU node) to Debian buster (preserving datanode dirs)
- 09:52 elukey: reimaged analytics1058 to debian buster (preserving datanode partitions)
- 07:50 elukey: attempt to reimage analytics1058 (part of the cluster, not a new worker node) to Buster
- 07:29 elukey: added journalnode partition to all hadoop workers not having it in the Analytics cluster
- 07:01 elukey: reboot an-worker1099 to clear out kernel soft lockup errors
- 06:59 elukey: restart datanode on an-worker1099 - soft lockup kernel errors
2021-02-25
[edit]- 17:04 razzi: rebalance kafka partitions for webrequest_upload_3
- 13:36 elukey: drop /srv/backup/wikistats from thorium
- 13:35 elukey: drop /srv/backup/backup_wikistats_1 from thorium
- 11:14 elukey: add an-worker111[7,8] to Analytics Hadoop (were previously backup worker nodes)
- 08:50 elukey: move analytics-privatedata/search/product to fixed gid/uid on all buster nodes (including airflow/stat100x/launcher)
2021-02-24
[edit]- 19:16 ottomata: service hadoop-yarn-nodemanager start on an-worker1112
- 16:03 milimetric: deployed refinery
- 14:09 elukey: roll restart druid brokers on druid public to pick up caffeine cache settings
- 14:03 elukey: roll restart druid brokers on druid analytics to pick up caffeine cache settings
- 11:08 elukey: restart druid-broker on an-druid1001 (used by Turnilo) with caffeine cache
- 09:01 elukey: roll restart druid brokers on druid public - locked
- 07:47 elukey: change gid/uid for druid + roll restart of all druid nodes
2021-02-23
[edit]- 21:20 ottomata: started nodemanager on an-worker1112
- 21:15 razzi: rebalance kafka partitions for webrequest_upload partition 2
- 19:31 elukey: roll out new uid/gid for mapred/druid/analytics/yarn/hdfs for all buster nodes (no op for stretch)
- 17:47 elukey: change uid/gid for yarn/mapred/analytics/hdfs/druid on stat100x, an-presto100x
- 15:57 elukey: an-launcher1002's timers restored
- 15:28 elukey: stop timers on an-launcher1002 to change gid/uid for yarn/hdfs/mapred/analytics/druid and to reboot for kernel updates
- 15:23 elukey: deploy new uid/gid scheme for yarn/mapred/analytics/hdfs/druid on an-tool100[8,9]
- 15:22 elukey: deploy new uid/gid scheme for yarn/mapred/analytics/hdfs/druid on an-airflow1001, an-test* buster nodes
- 15:05 klausman: an-master1001 ~ $ sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp analytics-privatedata-users /wmf/data/raw/webrequest/webrequest_text/hourly/2021/02/22/01/webrequest*
- 14:51 elukey: drop /srv/backup-1007 on stat1008 to free space
2021-02-22
[edit]- 19:27 ottomata: restart oozie on an-coord1001 to pick up new spark share lib without hadoop jars - T274384
- 14:38 ottomata: upgrade spark2 on analytics cluster to 2.4.4-bin-hadoop2.6-5~wmf0 (hadoop jars removed) - T274384
- 14:12 ottomata: upgrade spark2 on an-coord1001 to 2.4.4-bin-hadoop2.6-5~wmf0 (hadoop jars removed), will remove and auto-re add spark-2.4.4-assembly.zip in hdfs after running puppet here
- 14:07 ottomata: upgrade spark2 on stat1004 to 2.4.4-bin-hadoop2.6-5~wmf0 (hadoop jars removed)
- 09:01 elukey: reboot stat1005/stat1008 for kernel upgrades
2021-02-19
[edit]- 15:53 elukey: restart oozie again to test another setting for role/admins
- 15:43 ottomata: installing spark 2.4.4 without hadoop jars on analytics test cluster - T274384
- 15:31 elukey: restart oozie to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/665352
- 14:34 joal: rerun mobile_apps-uniques-daily-wf-2021-2-18
- 09:16 elukey: stop and decom the hadoop backup cluster
2021-02-18
[edit]- 18:38 razzi: rebalance kafka partition for webrequest_upload partition 1
- 17:27 elukey: an-coord1002 back in service with raid1 configured
- 15:48 elukey: stop hive/mysql on an-coord1002 as precautionary step to rebuild the md array
- 13:10 elukey: failover analytics-hive to an-coord1001 after maintenance (DNS change)
- 11:32 elukey: restart hive daemons on an-coord1001 to pick up new parquet settings
- 10:07 elukey: hive failover to an-coord1002 to apply new hive settings to an-coord1001
- 10:00 elukey: restart hive daemons on an-coord1002 (standby coord) to pick up new default parquet file format change
- 09:46 elukey: upgrade presto to 0.246-wmf on an-coord1001, an-presto*, stat100x
2021-02-17
[edit]- 17:44 razzi: rebalance kafka partitions for webrequest_upload partition 0
- 16:14 razzi: rebalance kafka partitions for eqiad.mediawiki.api-request
- 07:04 elukey: reboot stat1004/stat1006/stat1007 for kernel upgrades
2021-02-16
[edit]- 22:31 razzi: rebalance kafka partitions for codfw.mediawiki.api-request
- 17:44 razzi: rebalance kafka partitions for netflow
- 17:42 razzi: rebalance kafka partitions for atskafka_test_webrequest_text
- 07:32 elukey: restart hadoop daemons on an-worker1099 after reconfiguring a new disk
- 06:58 elukey: restart hdfs/yarn daemons on an-worker1097 to exclude a failed disk
2021-02-15
[edit]- 20:38 mforns: running hdfs fsck to troubleshoot corrupt blocks
- 17:28 elukey: restart hdfs namenodes on the main cluster to pick up new racking changes (worker nodes from the backup cluster)
2021-02-14
[edit]- 09:38 joal: Restart and backfill mediacount and mediarequest, and backfill mediarequest-AQS and mediacount archive
- 09:38 joal: deploy refinery onto hdfs
- 09:14 joal: Deploy hotfix for mediarequest and mediacount
2021-02-12
[edit]- 19:19 milimetric: deployed refinery with query syntax fix for the last broken cassandra job and an updated EL whitelist
- 18:34 razzi: rebalance kafka partitions for atskafka_test_webrequest_text
- 18:31 razzi: rebalance kafka partitions for __consumer_offsets
- 17:48 joal: Rerun wikidata-articleplaceholder_metrics-wf-2021-2-10
- 17:47 joal: Rerun wikidata-specialentitydata_metrics-wf-2021-2-10
- 17:43 joal: Rerun wikidata-json_entity-weekly-wf-2021-02-01
- 17:08 elukey: reboot presto workers for kernel upgrade
- 16:32 mforns: finished deployment of analytics-refinery
- 15:26 mforns: started deployment of analytics-refinery
- 15:16 elukey: roll restart druid broker on druid-public to pick up new settings
- 07:54 elukey: roll restart of druid brokers on druid-public - locked after scheduled datasource deletion
- 07:47 elukey: force a manual run of refinery-druid-drop-public-snapshots on an-launcher1002 (3d before its natural start) - controlled execution to see how druid + 3xdataset replication reacts
2021-02-11
[edit]- 14:26 joal: Restart oozie API job after spark sharelib fix (start: 2021-02-10T18:00)
- 14:20 joal: Rerun failed clicstream instance 2021-01 after sharelib fix
- 14:16 joal: Restart oozie after having fixed the spark-2.4.4 sharelib
- 14:12 joal: Fix oozie sharelib for spark-2.4.4 by copying oozie-sharelib-spark-4.3.0.jar onto the spark folder
- 02:19 milimetric: deployed again to fix old spelling error :) referererererer
- 00:05 milimetric: deployed refinery and synced to hdfs, restarting cassandra jobs gently
2021-02-10
[edit]- 21:46 razzi: rebalance kafka partitions for eqiad.mediawiki.cirrussearch-request
- 21:10 razzi: rebalance kafka partitions for codfw.mediawiki.cirrussearch-request
- 19:11 elukey: drop /user/oozie/share + chown o+rx -R /user/oozie/share + restart oozie
- 17:56 razzi: rebalance kafka partitions for eventlogging-client-side
- 01:07 milimetric: deployed refinery with some fixes after BigTop upgrade, will restart three coordinators right now
2021-02-09
[edit]- 22:04 razzi: rebalance kafka partitions for eqiad.resource-purge
- 20:51 joal: Rerun webrequest-load-coord-[text|upload] for 2021-02-09T07:00 after data was imported to camus
- 20:50 razzi: rebalance kafka partitions for codfw.resource-purge
- 20:31 joal: Rerun webrequest-load-coord-[text|upload] for 2021-02-09T06:00 after data was imported to camus
- 16:30 elukey: restart datanode on ana-worker1100
- 16:14 ottomata: restart datanode on analytics1059 with 16g heap
- 16:08 ottomata: restart datanode on an-worker1080 withh 16g heap
- 15:58 ottomata: restart datanode on analytics1058
- 15:55 ottomata: restart datenode on an-worker1115
- 15:38 elukey: restart namenode on an-master1002
- 15:01 elukey: restart an-worker1104 with 16g heap size to allow bootstrap
- 15:01 elukey: restart an-worker1103 with 16g heap size to allow bootstrap
- 14:57 elukey: restart an-worker1102 with 16g heap size to allow bootstrap
- 14:54 elukey: restart an-worker1090 with 16g heap size to allow bootstrap
- 14:50 elukey: restart analytics1072 with 16g heap size to allow bootstrap
- 14:50 elukey: restart analytics1069 with 16g heap size to allow bootstrap
- 14:08 elukey: restart analytics1069's datanode with bigger heap size
- 13:39 elukey: restart hdfs-datanode on analytics10[65,69] - failed to bootstrap due to issues reading datanode dirs
- 13:38 elukey: restart hdfs-datanode on an-worker1080 (test canary - not showing up in block report)
- 10:04 elukey: stop mysql replication an-coord1001 -> an-coord1002, an-coord1001 -> db1108
- 08:29 elukey: leave hdfs safemode to let distcp do its job
- 08:25 elukey: set hdfs safemode on for the Analytics cluster
- 08:19 elukey: umount /mnt/hdfs from all nodes using it
- 08:16 joal: Kill flink yarn app
- 08:08 elukey: stop jupyterhub on stat100x
- 08:07 elukey: stop hive on an-coord100[1,2] - prep step for bigtop upgrade
- 08:05 elukey: stop oozie an-coord1001 - prep step for bigtop upgrade
- 08:03 elukey: stop presto-server on an-presto100x and an-coord1001 - prep step for bigtop upgrade
- 07:28 elukey: roll out new apt bigtop changes across all hadoop-related nodes
- 07:19 joal: Killing yarn users applications
- 07:12 elukey: stop airflow on an-airflow1001 (prep step for bigtop)
- 07:09 elukey: stop namenode on an-worker1124 (backup cluster), create two new partitions for backup and namenode, restart namenode
- 06:14 elukey: disable timers on labstore nodes (prep step for bigtop)
- 06:11 elukey: disable systemd timers on an-launcher1002 (prep step for bigtop)
2021-02-08
[edit]- 22:29 elukey: the previous entry was related to the Hadoop backup cluster
- 22:29 elukey: hdfs master failover an-worker1118 -> an-worker1124, created dedicated partition for /var/lib/hadoop/name (root partition filled up), restarted namenode on 1118 (now recovering edit logs)
- 18:44 razzi: rebalance kafka partitions for eventlogging_VirtualPageView
- 15:12 ottomata: set kafka topic retention to 31 days for (eqiad|codfw.rdf-streaming-updater.mutation) in kafka main-eqiad and main-codfw - T269619
2021-02-05
[edit]- 20:31 razzi: rebalance kafka partitions for eventlogging_SearchSatisfaction
- 19:11 razzi: rebalance kafka partitions for eqiad.mediawiki.client.session_tick
- 18:38 razzi: rebalance kafka partitions for codfw.mediawiki.client.session_tick
- 17:53 razzi: rebalance kafka partitions for codfw.resource_change
- 17:53 razzi: rebalance kafka partitions for eqiad.resource_change
- 11:31 elukey: restart turnilo to pick up changes to the config (two new attributes to webrequest_128)
2021-02-04
[edit]- 19:27 razzi: rebalance kafka partitions for eqiad.mediawiki.job.wikibase-addUsagesForPage
- 19:27 razzi: rebalance kafka partitions for codfw.mediawiki.job.wikibase-addUsagesForPage
- 19:22 razzi: rebalance kafka partitions for eventlogging_MobileWikiAppLinkPreview
- 17:04 elukey: restart presto coordinator on an-coord1001 to pick up logging settings (log to http-request.log)
- 17:02 elukey: roll restart presto on an-presto* to finally get http-request.log
- 11:28 elukey: move aqs druid snapshot config to 2021-01
- 09:01 elukey: restart superset and disable memcached caching
- 08:08 elukey: move an-worker1117 from Hadoop Analytics to Hadoop Backup
2021-02-03
[edit]- 21:38 razzi: rebalance kafka partitions for eventlogging_MobileWikiAppLinkPreview
- 20:04 razzi: rebalance kafka partitions for eqiad.mediawiki.job.RecordLintJob
- 20:03 razzi: rebalance kafka partitions for codfw.mediawiki.job.RecordLintJob
- 18:28 razzi: rebalance kafka partitions for eqiad.mediawiki.job.refreshLinks
- 18:28 razzi: rebalance kafka partitions for codfw.mediawiki.job.refreshLinks
- 17:52 razzi: rebalance kafka partitions for eqiad.wdqs-internal.sparql-query
- 17:50 razzi: rebalance kafka partitions for codfw.wdqs-internal.sparql-query
- 14:48 elukey: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod -R o+rx /wmf/data/wmf/mediawiki/history_reduced
- 14:45 elukey: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod o+rx /wmf/data/wmf/mediawiki
- 14:40 elukey: kill + restart webrequest-druid-{hourly,daily} to pick up new changes after refinery deployment
- 14:30 elukey: kill + relaunch webrequest_load to pick up new changes after refinery deployment
- 14:28 elukey: relaunch edit-hourly-druid-coord 02-2021 after chmods
- 14:25 elukey: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod -R o+rx /wmf/data/wmf/edit
- 14:24 elukey: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod o+rx /wmf/data/wmf
- 10:57 elukey: deploy refinery to hdfs
- 10:36 elukey: released Refinery Source 0.1.0
- 08:54 elukey: drop v0.1.x tags from Refinery source upstream repo
- 08:48 elukey: drop refinery source artifacts v0.1.2 from Archiva
2021-02-02
[edit]- 20:39 razzi: rebalance kafka partitions for eqiad.mediawiki.job.htmlCacheUpdate
- 20:39 razzi: rebalance kafka partitions for codfw.mediawiki.job.htmlCacheUpdate
- 19:29 ottomata: manually altered event.codemirrorusage to fix incompatible type change: https://phabricator.wikimedia.org/T269986#6797385
- 19:28 elukey: change archiva-ci password in pwstore, archiva and jenkins
- 17:53 razzi: rebalance kafka partitions for eqiad.wdqs-external.sparql-query
- 17:17 razzi: rebalance kafka partitions for eventlogging_CentralNoticeImpression
- 16:39 razzi: rebalance kafka partitions for eventlogging_InukaPageView
- 08:42 elukey: decommission an-worker1117 from the Hadoop cluster, to move it under the Backup cluster
2021-02-01
[edit]- 21:27 razzi: rebalance kafka partitions for eqiad.mediawiki.job.cdnPurge
- 21:27 razzi: rebalance kafka partitions for codfw.mediawiki.job.cdnPurge
- 20:51 razzi: rebalance kafka partitions for eventlogging_PaintTiming
- 19:01 razzi: rebalance kafka partitions for eventlogging_LayoutShift
- 18:58 razzi: rebalance kafka partitions for eqiad.mediawiki.job.recentChangesUpdate
- 18:58 razzi: rebalance kafka partitions for codfw.mediawiki.job.recentChangesUpdate
- 18:23 razzi: rebalance kafka partitions for codfw.mediawiki.recentchange
- 18:09 razzi: rebalance kafka partitions for eqiad.resource_change
2021-01-29
[edit]- 20:23 razzi: rebalance kafka partitions for eventlogging_NavigationTiming
- 19:30 razzi: rebalance kafka partitions for eqiad.mediawiki.revision-score
- 19:29 razzi: rebalance kafka partitions for codfw.mediawiki.revision-score
- 19:14 razzi: rebalance kafka partitions for eventlogging_CpuBenchmark
- 19:11 razzi: rebalance kafka partitions for eqiad.mediawiki.page-links-change
- 19:10 razzi: rebalance kafka partitions for codfw.mediawiki.page-links-change
- 14:33 elukey: rollback presto upgrade, worker seems not able to announce themselves to the query coordinator
- 14:08 elukey: upgrade presto to 0.246 (from 0.226) on an-presto1001 - worker node
- 14:02 elukey: upgrade presto to 0.246 (from 0.226) on an-coord1001 - query coordinator
- 07:44 joal: Copy /wmf/data/event_sanitized to backup cluster (T272846)
2021-01-28
[edit]- 22:23 razzi: rebalance kafka partitions for eqiad.mediawiki.page-links-change
- 22:22 razzi: rebalance kafka partitions for codfw.mediawiki.page-links-change
- 22:01 razzi: rebalance kafka partitions for eventlogging_QuickSurveyInitiation
- 21:13 razzi: rebalance kafka partitions for topic eventlogging_EditAttemptStep
- 19:49 mforns: finished deployment of refinery (for v0.0.146)
- 18:57 mforns: starting deployment of refinery (for v0.0.146)
- 18:54 mforns: deployed refinery-source v0.0.146 using Jenkins
- 18:45 razzi: rebalance kafka partitions for topic eqiad.mediawiki.job.ORESFetchScoreJob
- 18:42 razzi: rebalance kafka partitions for topic codfw.mediawiki.job.ORESFetchScoreJob
- 18:22 razzi: rebalance kafka partitions for topic codfw.mediawiki.job.wikibase-InjectRCRecords
- 17:26 razzi: rebalance kafka partitions for topic eqiad.mediawiki.revision-tags-change
- 17:26 razzi: rebalance kafka partitions for topic codfw.mediawiki.revision-tags-change
- 16:32 razzi: rebalance kafka partitions for topic eventlogging_CodeMirrorUsage
- 16:16 elukey: manual failover of hdfs namenode active/master from an-master1002 to an-master1001
2021-01-27
[edit]- 13:02 joal: Copy /wmf/data/event to backup cluster (30Tb) - T272846
- 11:15 elukey: add client_port and debug fields to X-Analytics in webrequest varnishkafka streams
2021-01-26
[edit]- 16:39 razzi: reboot kafka-test1006 for kernel upgrade
- 09:37 elukey: reboot dbstore1005 for kernel upgrades
- 09:35 joal: Copy /wmf/data/discovery to backup cluster (21Tb) - T272846
- 09:31 elukey: reboot dbstore1003 for kernel upgrades
- 09:15 elukey: reboot dbstore1004 for kernel upgrades
- 09:07 joal: Copy /wmf/refinery to backup cluster (1.1Tb) - T272846
- 09:01 joal: Copy /wmf/discovery to backup cluster (120Gb) - T272846
- 08:42 joal: Copy /wmf/camus to backup cluster (120Gb) - T272846
2021-01-25
[edit]- 20:42 razzi: rebalance kafka partitions for eqiad.mediawiki.page-properties-change.json
- 20:41 razzi: rebalance kafka partitions for codfw.mediawiki.page-properties-change
- 18:58 razzi: rebalance kafka partitions for eventlogging_ExternalGuidance
- 18:53 razzi: rebalance kafka partitions for eqiad.mediawiki.job.ChangeDeletionNotification
- 17:13 joal: Copy /user to backup cluster (92Tb) - T272846
- 16:23 elukey: drain+restart cassandra on aqs1004 to pick up the new openjdk (canary)
- 16:21 elukey: restart yarn and hdfs daemon on analytics1058 (canary node for new openjdk)
- 12:25 joal: Copy /wmf/data/archive to backup cluster (32Tb) - T272846
- 10:20 elukey: restart memcached on an-tool1010 to flush superset's cache
- 10:18 elukey: restart superset to remove druid datasources support - T263972
- 09:57 joal: Changing ownership of archive WMF files to analytics:analytics-privatedata-users after update of oozie jobs
2021-01-22
[edit]- 17:38 mforns: finished refinery deploy to HDFS
- 17:28 mforns: restarted refine_event and refine_eventlogging_legacy in an-launcher1002
- 17:11 mforns: starting refinery deploy using scap
- 17:09 mforns: bumped up refinery-source jar version to 0.0.145 in puppet for Refine and DruidLoad jobs
- 16:44 mforns: Deployed refinery-source v0.0.145 using jenkins
- 09:48 joal: Raise druid-public default replication-factor from 2 to 3
2021-01-21
[edit]- 18:54 razzi: rebooting nodes for druid public cluster via cookbook
- 16:49 ottomata: installed libsnappy-dev and python3-snappy on webperf1001
- 15:17 joal: Kill mediawiki-wikitext-history-wf-2020-12 as it was stuck and failed
- 11:19 elukey: block UA with 'python-requests.*' hitting AQS via Varnish
2021-01-20
[edit]- 21:48 milimetric: refinery deployed, synced to hdfs, ready to restart 53 oozie jobs, will do so slowly over the next few hours
- 18:11 joal: Release refinery-source v0.0.144 to archiva with Jenkins
2021-01-15
[edit]- 09:21 elukey: roll restart druid brokers on druid public - stuck after datasource drop
2021-01-11
[edit]- 07:26 elukey: execute 'sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod o+rx /wmf/data/archive/mediawiki' on launcher to fix dir perms
2021-01-09
[edit]- 15:11 elukey: restart timers 'analytics-*' on labstore100[6,7] to apply new permission settings
- 08:31 elukey: restart the failed hdfs rsync timers on labstore100[6,7] to kick off the remaining jobs
- 08:30 elukey: execute hdfs chmod o+x of /wmf/data/archive/projectview /wmf/data/archive/projectview/legacy /wmf/data/archive/pageview/legacy to unblock hdfs rsyncs
- 08:24 elukey: execute "sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod o+rx /wmf/data/archive/pageview" to unblock labstore hdfs rsyncs
- 08:21 elukey: execute "sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod o+rx /wmf/data/archive/geoeditors" to unblock labstore hdfs rsync
2021-01-08
[edit]- 18:54 joal: Restart jobs for permissions-fix (clickstream, mediacounts-archive, geoeditors-public_monthly, geoeditors-yearly, mobile_app-uniques-[daily|monthly], pageview-daily_dump, pageview-hourly, projectview-geo, unique_devices-[per_domain|per_project_family]-[daily|monthly])
- 18:14 joal: Restart projectview-hourly job (permissions test)
- 18:03 joal: Deploy refinery onto HDFS
- 17:50 joal: deploy refinery with scap
- 10:01 elukey: restart varnishkafka-webrequest on cp5001 - timeouts to kafka-jumbo1001, librdkafka seems not recovering very well
- 08:46 elukey: force restart of check_webrequest_partitions.service on an-launcher1002
- 08:44 elukey: force restart of monitor_refine_eventlogging_legacy_failure_flags.service
- 08:18 elukey: raise default max executor heap size for Spark refine to 4G
2021-01-07
[edit]- 18:22 elukey: chown -R /tmp/analytics analytics:analytics-privatedata-users (tmp dir for data quality stats tables)
- 18:21 elukey: "sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chown -R analytics:analytics-privatedata-users /wmf/data/wmf/data_quality_stats"
- 18:10 elukey: disable temporarily hdfs-cleaner.timer to prevent /tmp/DataFrameToDruid to be dropped
- 18:08 elukey: chown -R /tmp/DataFrameToDruid analytics:druid (was: analytics:hdfs) on hdfs to temporarily unblock Hive2Druid jobs
- 16:31 elukey: remove /etc/mysql/conf.d/research-client.cnf from stat100x nodes
- 15:40 elukey: deprecate the 'reseachers' posix group for good
- 11:24 elukey: execute "sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod -R o-rwx /wmf/data/event_sanitized" to fix some file permissions as well
- 10:36 elukey: execute "sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod -R o-rwx /wmf/data/event" on an-master1001 to fix some file permissions (an-launcher executed timers during the past hours without the new umask) - T270629
- 09:37 elukey: forced re-run of monitor_refine_event_failure_flags.service on an-launcher1002 to clear alerts
- 08:26 joal: Rerunning 4 failed refine jobs (mediawiki_cirrussearch_request, day=6/hour=20|21, day=7/hour=0|2)
- 08:14 elukey: re-enable puppet on an-launcher1002 to apply new refine memory settings
- 07:59 elukey: re-enabling all oozie jobs previously suspended
- 07:54 elukey: restart oozie on an-coord1001
2021-01-06
[edit]- 20:42 ottomata: starting remaining refine systemd timers
- 20:19 ottomata: restarted eventlogging_to_druid timers
- 20:19 ottomata: restarted drop systemd timers
- 20:18 ottomata: restarted reportupdater timers
- 20:14 ottomata: re-starting camus systemd timers
- 16:45 razzi: restart yarn nodemanagers
- 16:08 razzi: manually failover hdfs haadmin from an-master1002 to an-master1001
- 15:53 ottomata: stopping analytics systemd timers on an-launcher1002
2021-01-05
[edit]- 21:32 ottomata: bumped mediawiki history snapshot version in AQS
- 20:45 ottomata: Refine changes: event tables now have is_wmf_domain, canary events are removed, and corrupt records will result in a better monitoring email
- 20:43 razzi: deploy aqs as part of train
- 19:17 razzi: deploying refinery for weekly train
- 09:29 joal: Manually reload unique-devices monthly in cassandra to fix T271170
2021-01-04
[edit]- 22:22 razzi: reboot an-test-coord1001 to upgrade kernel
- 14:24 elukey: deprecate the analytics-users group
2021-01-03
[edit]- 14:11 milimetric: reset-failed refinery-sqoop-whole-mediawiki.service
- 14:10 milimetric: manual sqoop finished, logs on an-launcher1002 at /var/log/refinery/sqoop-mediawiki.log and /var/log/refinery/sqoop-mediawiki-production.log
2021-01-01
[edit]- 14:54 milimetric: deployed refinery hotfix for sqoop problem, after testing on three small wikis