Analytics/Server Admin Log/Archive/2020
Appearance
2020-12-29
[edit]- 09:18 elukey: restart hue to pick up analytics-hive endpoint settings
2020-12-23
[edit]- 15:53 ottomata: point analytics-hive.eqiad.wmnet back at an-coord1001 - T268028 T270768
2020-12-22
[edit]- 19:35 elukey: restart hive daemons on an-coord1001 to pick up new settings
- 18:13 elukey: failover analytics-hive.eqiad.wmnet to an-coord1002 (to allow maintenance on an-coord1001)
- 18:07 elukey: restart hive server on an-coord1002 (current standby - no traffic) to pick up the new config (use the local metastore as opposed to what it is pointed by analytics-hive)
- 17:00 mforns: Deployed refinery as part of weekly train (v0.0.142)
- 16:42 mforns: Deployed refinery-source v0.0.142
- 15:00 razzi: stopping superset server on analytics-tool1004
- 10:36 elukey: restart presto coordinator to pick up analytics-hive settings
- 10:25 elukey: failover analytics-hive.eqiad.wmnet to an-coord1001
- 09:56 elukey: restart hive daemons on an-coord1001 to pick up analytics-hive settings
- 07:27 elukey: reboot stat100[4-8] (analytics hadoop clients) for kernel upgrades
- 07:23 elukey: move all analytics clients (spark refine, stat100x, hive-site.xml on hdfs, etc..) to analytics-hive.eqiad.wmnet
2020-12-18
[edit]- 14:10 elukey: restore stat1004 to its previous settings for kerberos credential cache
2020-12-17
[edit]- 14:54 klausman: Updated all stat100x machines to now sport kafkacat 1.6.0, backported from Bullseye
- 11:04 elukey: wipe/reimage the hadoop test cluster to start clean for CDH (and then test the upgrade to bigtop 1.5)
2020-12-16
[edit]- 21:07 joal: Kill-restart virtualpageview-hourly-coord and projectview-geo-coord with manually updated jar versions (old versions in conf)
- 19:35 joal: Kill-restart all oozie jobs belonging to analytics except mediawiki-wikitext-history-coord
- 18:52 joal: Kill-restart cassandra loading oozie jobs
- 18:37 joal: Kill-restart wikidata-entity, wikidata-item_page_link and mobile_apps-session_metrics oozie jobs
- 18:31 joal: Kill-rerun data-quality bundles
- 16:17 razzi: dropping and re-creating superset staging database
- 08:13 joal: Manually push updated pageview whitelist to HDFS
2020-12-15
[edit]- 20:24 joal: Kill restart webrequest_load oozie job after deploy
- 19:43 joal: Deploy refinery onto HDFS
- 19:14 joal: Scap deploy refinery
- 18:26 joal: Release refinery-source v0.0.141
2020-12-14
[edit]- 19:09 razzi: restart restart hadoop-yarn-resourcemanager on an-master1002 to promote an-master1001 to active again
- 19:08 razzi: restarted hadoop-yarn-resourcemanager on an-master1001 again by mistake
- 19:02 razzi: restart hadoop-yarn-resourcemanager on an-master1002
- 18:54 razzi: restart hadoop-yarn-resourcemanager on an-master1001
- 18:43 razzi: applying yarn config change via `sudo cumin "A:hadoop-worker" "systemctl restart hadoop-yarn-nodemanager" -b 10`
- 14:58 elukey: stat1004's krb credential cache moved under /run (shared between notebooks and ssh/bash) - T255262
- 07:55 elukey: roll restart yarn daemons to pick up https://gerrit.wikimedia.org/r/c/operations/puppet/+/649126
2020-12-11
[edit]- 19:30 ottomata: now ingesting Growth EventLogging schemas using event platform refine job; they are exclude-listed from eventlogging-processor. - T267333
- 07:04 elukey: roll restart presto cluster to pick up new jvm xmx settings
- 06:57 elukey: restart presto on an-presto1003 since all the memory on the host was occupied, and puppet failed to run
2020-12-10
[edit]- 12:29 joal: Drop-Recreate-Repair wmf_raw.mediawiki_image table
2020-12-09
[edit]- 20:34 elukey: execute on mysql:an-coord1002 "set GLOBAL replicate_wild_ignore_table='superset_staging.%'" to avoid replication for superset_staging from an-coord1002
- 07:12 elukey: re-enable timers after maintenance
- 07:07 elukey: restart hive-server2 on an-coord1002 for consistency
- 07:05 elukey: restart hive metastore and server2 on an-coord1001 to pick up settings for DBTokenStore
- 06:50 elukey: stop timers on an-launcher1002 as prep step to restart hive
2020-12-07
[edit]- 18:51 joal: Test mediawiki-wikitext-history new sizing settings
- 18:43 razzi: kill testing flink job: sudo -u hdfs yarn application -kill application_1605880843685_61049
- 18:42 razzi: truncate /var/lib/hadoop/data/h/yarn/logs/application_1605880843685_61049/container_e27_1605880843685_61049_01_000002/taskmanager.log on an-worker1011
2020-12-03
[edit]- 22:34 milimetric: updated mw history snapshot on AQS
- 07:09 elukey: manual reset-failed refinery-sqoop-whole-mediawiki.service on an-launcher1002 (job launched manually)
2020-12-02
[edit]- 21:37 joal: Manually create _SUCCESS flags for banner history monthly jobs to kick off (they'll be deleted by the purge tomorrow morning)
- 21:16 joal: Rerun timed out jobs after oozie config got updated (mediawiki-geoeditors-yearly-coord and banner_activity-druid-monthly-coord)
- 20:49 ottomata: deployed eventgate-analytics-external with refactored stream config, hopefully this will work around the canary events alarm bug - T266573
- 18:20 mforns: finished netflow migration wmf->event
- 17:50 mforns: starting netflow migration wmf->event
- 17:50 joal: Manually start refinery-sqoop-production on an-launcher1002 to cover for couped runs failure
- 16:50 mforns: restarted turnilo to clear deleted datasource
- 16:47 milimetric: faked _SUCCESS flag for image table to allow daisy-chained mediawiki history load dependent coordinators to keep running
- 07:49 elukey: restart oozie to pick up new settings for T264358
2020-12-01
[edit]- 19:43 razzi: deploy refinery with refinery-source v0.0.140
- 10:50 elukey: restart oozie to pick up new logging settings
- 09:03 elukey: clean up old hive metastore/server old logs on an-coord1001 to free space
2020-11-30
[edit]- 17:51 joal: Deploy refinery onto hdfs
- 17:49 joal: Kill-restart mediawiki-history-load job after refactor (1 coordinator per table) and tables addition
- 17:32 joal: Kill-restart mediawiki-history-reduced job for druid-public datasource number of shards update
- 17:32 joal: Deploy refinery using scap for naming hotfix
- 15:29 ottomata: migrated EventLogging schemas SpecialMuteSubmit and SpecialInvestigate to EventGate - T268517
- 14:56 joal: Deploying refinery onto hdfs
- 14:49 joal: Create new hive tables for newly sqooped data
- 14:45 joal: Deploy refinery using scap
- 09:08 elukey: force execution of refinery-drop-pageview-actor-hourly-partitions on an-launcher1002 (after args fixup from Joseph)
2020-11-27
[edit]- 14:51 elukey: roll restart zookeeper on druid* nodes for openjdk upgrades
- 10:29 elukey: restart eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 (failed) to see if the hive metastore works
- 10:27 elukey: restart oozie and presto-server on an-coord1001 for openjdk upgrades
- 10:27 elukey: restart hive server and metastore on an-coord1001 - openjdk upgrades + problem with high GC caused by a job
- 08:05 elukey: roll restart druid public cluster for openjdk upgrades
2020-11-26
[edit]- 13:52 elukey: roll restart druid daemons on druid analytics to pick up new openjdk upgrades
- 13:08 elukey: force umount/mount of all /mnt/hdfs mountpoints to pick up opendjdk upgrades
- 09:07 elukey: force purging https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/user/Diego_Maradona/daily/2020110500/2020112500 from caches
- 08:40 elukey: roll restart cassandra on aqs10* for openjdk upgrades
2020-11-25
[edit]- 19:04 joal: Killing job application_1605880843685_18336 as it consumes too much resources
- 18:40 elukey: restart turnilo to pick up new netflow config changes
- 16:46 elukey: move analytics1066 to C3
- 16:11 elukey: move analytics1065 to C3
- 15:38 elukey: move stat1004 to A5
2020-11-24
[edit]- 19:33 elukey: kill and restart webrequest_load bundle to pick up analytics-hive.eqiad.wmnet settings
- 19:05 elukey: deploy refinery to hdfs (even if not really needed)
- 18:47 elukey: deploy analytics refinery as part of the regular weekly train
- 15:38 elukey: move druid1005 from rack B7 to B6
- 14:59 elukey: move analytics1072 from rack B2 to B3
- 09:16 elukey: drop principals and keytabs for analytics10[42-57] - T267932
2020-11-21
[edit]- 08:10 elukey: remove big stderrlog fine in /var/lib/hadoop/data/d/yarn/logs/application_1605880843685_1450 on an-worker1110
- 08:05 elukey: remove big stderrlog fine in /var/lib/hadoop/data/e/yarn/logs/application_1605880843685_1450 on an-worker1105
2020-11-20
[edit]- 21:09 razzi: truncate /var/lib/hadoop/data/u/yarn/logs/application_1605880843685_0581/container_e27_1605880843685_0581_01_000171/stderr logfile on an-worker1098
2020-11-19
[edit]- 16:35 elukey: roll restart hadoop workers for openjdk upgrades
- 07:07 elukey: roll restart java daemons on Hadoop test for openjdk upgrades
- 06:50 elukey: restart refinery-import-siteinfo-dumps.service on an-launcher1002
2020-11-18
[edit]- 09:22 elukey: set dns_canonicalize_hostname = false to all kerberos clients
2020-11-17
[edit]- 23:09 mforns: restarted browser general oozie job
- 23:00 mforns: finished deploying refinery (regular weekly deployment train)
- 22:36 mforns: deploying refinery (regular weekly deployment train)
- 15:11 elukey: drop backup@localhost user from an-coord1001's mariadb meta instance (not used anymore)
- 15:09 elukey: drop 'dump' user from an-coord1001's analytics meta (related to dbprov hosts, previous attempts before db1108)
- 14:57 elukey: stutdown stat1008 for ram expansion
- 11:28 elukey: set analytics meta instance on an-coord1002 as replica of an-coord1001
2020-11-16
[edit]- 10:41 klausman: about to update stat1008 to new kernel and rocm
- 09:13 joal: Rerun webrequest-refine for hours 0 to 6 of day 2020-11-16 - This will prevent webrequest-druid-daily to get loaded with incoherent data due to bucketing change
- 08:45 joal: Correct webrequest job directly on HDFS and restart webrequest bundle oozie job
- 08:43 joal: Kill webrequest bundle to correct typo
- 08:31 joal: Restart webrequest bundle oozie job with update
- 08:31 joal: Restart webrequest bun
- 08:25 joal: Deploying refinery onto HDFS
- 08:13 joal: Deploying refinery with scap
- 08:01 joal: Repair wmf.webrequest hive table partitions
- 08:01 joal: Recreate wmf.webrequest hive table with new partitioning
- 08:00 joal: Drop webrequest table
- 07:55 joal: Kill webrequest-bundle oozie job for table update
2020-11-15
[edit]- 08:27 elukey: truncate -s 10g /var/lib/hadoop/data/n/yarn/logs/application_1601916545561_173219/container_e25_1601916545561_173219_01_000177/stderr on an-worker1100
- 08:21 elukey: sudo truncate -s 10g /var/lib/hadoop/data/c/yarn/logs/application_1601916545561_173219/container_e25_1601916545561_173219_01_000019/stderr on an-worker1098
2020-11-10
[edit]- 19:32 joal: Deploy wikistats2 v2.8.2
- 18:16 joal: Releasing refinery-source v0,0,139 to archiva
- 14:48 mforns: restarted data quality stats daily bundle with new metric
- 13:30 elukey: add hive-server2 to an-coord1002
- 07:40 elukey: upgrade hue to hue_4.8.0-2 on an-tool1009
2020-11-09
[edit]- 18:34 elukey: drop hdfs-balancer multi-gb log file from launcher1002
- 18:33 elukey: manually start logrotate.timer apt.timer etc.. on an-launcher1002 - stopped since the last time that I have disabled timers
- 17:48 razzi: reboot an-coord1002 to see if it updates kernel cpu instructions
2020-11-08
[edit]- 06:31 elukey: truncate huge log file on an-worker1103 for app id application_1601916545561_147041
2020-11-06
[edit]- 19:00 mforns: launched backfilling of data quality stats for os_family_entropy_by_access_method
2020-11-05
[edit]- 18:32 razzi: shutting down kafka-jumbo1005 to allow dcops to upgrade NIC
- 17:47 razzi: shutting down kafka-jumbo1004 to allow dcops to upgrade NIC
- 16:57 razzi: shutting down kafka-jumbo1003 to allow dcops to upgrade NIC
- 16:25 razzi: shutting down kafka-jumbo1002 to allow dcops to upgrade NIC
- 14:55 elukey: shutdown kafka-jumbo1001 to swap NICs (1g -> 10g)
- 06:30 elukey: truncate application_1601916545561_129457's taskmanager.log (~600G) on an-worker1113 due to partition 'e' full
- 02:05 milimetric: deployed refinery pointing to refinery-source v0.0.138
2020-11-04
[edit]- 09:20 elukey: upgrade hue to 4.8.0 on hue-next
2020-11-03
[edit]- 16:52 elukey: mv /srv/analytics.wikimedia.org/published/datasets/archive/public-datasets to /srv/backup/public-datasets on thorium - T265971
- 15:52 elukey: re-enable timers after maintenance
- 14:02 elukey: stop timers on an-launcher1002 to drain the cluster (an-coord1001 maintenance prep-step)
- 13:02 elukey: force a restart of performance-asoranking.service on stat1007 after fix for pandas' sort() - T266985
- 07:26 elukey: re-run cassandra-daily-coord-local_group_default_T_pageviews_per_article_flat failed hour via hue
2020-11-02
[edit]- 21:15 ottomata: evolved Hive table event.contenttranslationabusefilter to match migrated event platform schema - T259163
- 13:40 elukey: roll restart zookeeper ok an-conf* to pick up new openjdk upgrades
- 12:40 elukey: forced re-creation of base jupyterhub venvs on stat1007
2020-10-30
[edit]- 17:01 elukey: kafka preferred-replica-election on jumbo1001
2020-10-29
[edit]- 14:25 elukey: restart zookeeper on an-conf1001 for openjdk upgrades
2020-10-27
[edit]- 17:38 ottomata: restrict Fuzz Faster U Fool user agents from submittnig eventlogging legacy systemd data - T266130
2020-10-22
[edit]- 14:05 ottomata: bump camus version to wmf12 for all camus jobs. should be no-op now. - T251609
- 13:56 ottomata: camus-eventgate-main_events now uses EventStreamConfig to discover topics to ingest, but still uses regex to find topics to monitor - T251609
- 13:04 ottomata: camus-eventgate-analytics_events now uses EventStreamConfig to discovery topics to ingest and canary topics to monitor - T251609
- 13:03 elukey: restart turnilo to pick up new wmf_netflow settings
- 11:51 ottomata: camus-eventgate-analytics-external now uses EventStreamConfig to discovery topics to ingest and canary topics to monitor
- 07:03 elukey: decom analytics1057 from the Hadoop cluster
- 06:54 elukey: restart httpd on matomo1002, errors while connecting
- 06:31 elukey: restart turnilo to apply new settings for wmf_netflow
- 06:06 elukey: execute "sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chown -R analytics /wmf/data/archive/geoip" on an-launcher1002 - permission issues for 'analytics' and /wmf/data/archive/geoip
- 02:37 ottomata: re-run webrequest-load-wf-{text,upload}-2020-10-21-{19,20} oozie jobs after they timed out waiting for data due to camus misconfiguration (fixed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/635678)
2020-10-21
[edit]- 20:12 razzi: stop nginx on analytics-tool1001.eqiad.wmnet to switch to envoy (hue-next)
- 20:10 razzi: stop nginx on analytics-tool1001.eqiad.wmnet to switch to envoy (hue)
- 20:07 razzi: stop nginx on analytics-tool1007.eqiad.wmnet to switch to envoy (turnilo)
- 20:05 razzi: stop nginx on analytics-tool1004.eqiad.wmnet to switch to envoy (superset)
- 20:02 razzi: stop nginx on matomo1002.eqiad.wmnet to switch to envoy
- 10:41 elukey: decommission analytics1052 from the hadoop cluster
- 10:26 elukey: move journalnode from analytics1052 (to be decommed) to an-worker1080
2020-10-20
[edit]- 20:59 mforns: Deploying refinery with refinery-deploy-to-hdfs (for 0.0.137)
- 20:24 mforns: Deploying refinery with scap for v0.0.137
- 20:00 mforns: Deployed refinery-source v0.0.137
- 15:00 ottomata: disabling sending EventLogging events to eventlogging-valid-mixed topic - T265651
- 13:34 elukey: upgrade superset's presto TLS config after the above changes
- 13:33 elukey: move presto to pupet host TLS certificates
- 10:29 klausman: rocm38 install on an-worker1101 successful, rebooting to make sure everything is in place
- 06:41 elukey: decom analytics1056 from the hadoop cluster
2020-10-19
[edit]- 14:40 ottomata: restarted eventlogging-processor with filter to skip events already migrated to event platform - T262304
- 10:09 elukey: add pps/bps measures to wmf_netflow in turnilo
- 07:27 elukey: decom analytics1055 from the hadoop cluster
- 06:47 elukey: turnilo upgraded to 1.27.0
2020-10-18
[edit]- 07:01 elukey: decom analytics1054 from hadoop
2020-10-17
[edit]- 06:08 elukey: decom analytics1053 from the hadoop cluster
2020-10-15
[edit]- 17:57 razzi: taking yarn.wikimedia.org offline momentarily to test new tls configuration: T240439
- 14:51 elukey: roll restart druid-historical daemons on druid1004-1008 to pick up new conn pooling changes
- 07:03 elukey: restart oozie to pick up the analytics team's admin list
- 06:09 elukey: decommission analytics1050 from the hadoop cluster
2020-10-14
[edit]- 17:39 joal: Rerun refine for mediawiki_api_request failed hour
- 15:59 elukey: drain + reboot an-worker1100 to pick up GPU settings
- 15:29 elukey: drain + reboot an-worker110[1,2] to pick up GPU settings
- 14:56 elukey: drain + reboot an-worker109[8,9] to pick up GPU settings
- 05:48 elukey: decom analytics1049 from the Hadoop cluster
2020-10-13
[edit]- 12:38 elukey: drop /srv/backup/mysql from an-master1002 (not used anymore)
- 08:59 klausman: Regenned the jupyterhub venvs on stat1004
- 07:56 klausman: re-imaging stat1004 to Buster
- 06:20 elukey: decom analytics1048 from the Hadoop cluster
2020-10-12
[edit]- 11:36 joal: Clean druid test-datasources
- 11:32 elukey: remove analytics-meta lvm backup settings from an-coord1001
- 11:23 elukey: remove analytics-meta lvm backup settings from an-master1002
- 07:02 elukey: reduce hdfs block replication factor on Hadoop test to 2
- 05:37 elukey: decom analytics1047 from the Hadoop cluster
2020-10-11
[edit]- 08:33 elukey: drop some old namenode backups under /srv on an-master1002 to free some space
- 08:24 elukey: decommission analytics1046 from the hadoop cluster
- 08:12 elukey: clean up logs on an-launcher1002 (disk space full)
2020-10-10
[edit]- 12:01 elukey: decommission analytics1045 from the Hadoop cluster
2020-10-09
[edit]- 13:17 elukey: execute "cumin 'stat100[5,8]* or an-worker109[6-9]* or an-worker110[0,1]*' 'apt-get install -y linux-headers-amd64'"
- 11:15 elukey: bootstrap the Analytics Hadoop test cluster
- 09:47 elukey: roll restart of hadoop-yarn-nodemanager on all hadoop workers to pick up new settings
- 07:58 elukey: decom analytics1044 from Hadoop
- 07:04 elukey: failover from an-master1002 to 1001 for HDFS namenode (the namenode failed over hours ago, no logs to check)
2020-10-08
[edit]- 18:08 razzi: restart oozie server on an-coord1001 for reverting T262660
- 17:42 razzi: restart oozie server on an-coord1001 for T262660
- 17:19 elukey: removed /var/lib/puppet/clientbucket/6/f/a/c/d/9/8/d/6facd98d16886787ab9656eef07d631e/content on an-launcher1002 (29G, last modified Aug 4th)
- 15:45 elukey: executed git pull on /srv/jupyterhub/deploy and run again create_virtualenv.sh on stat1007 (pyspark kernels may not run correctly due to a missing feature)
- 15:43 elukey: executed git pull on /srv/jupyterhub/deploy and run again create_virtualenv.sh on stat1006 (pyspark kernels not running due to a missing feature)
- 13:13 elukey: roll restart of druid overlords and coordinators on druid public to pick up new TLS settings
- 12:51 elukey: roll restart of druid overlords and coordinators on druid analytics to pick up new TLS settings
- 10:35 elukey: force the re-creation of default jupyterhub venvs on stat1006 after reimage
- 08:47 klausman: Starting re-image of stat1006 to Buster
- 07:14 elukey: decom analytics1043 from the Hadoop cluster
- 06:46 elukey: move the hdfs balancer from an-coord1001 to an-launcher1002
2020-10-07
[edit]- 08:45 elukey: decom analytics1042 from hadoop
2020-10-06
[edit]- 13:14 elukey: cleaned up /srv/jupyter/venv and re-created it to allow jupyterhub to start cleanly on stat1007
- 12:56 joal: Restart oozie to pick up new spark settings
- 12:47 elukey: force re-creation of the base virtualenv for jupyter on stat1007 after the reimage
- 12:20 elukey: update HDFS Namenode GC/Heap settings on an-master100[1,2]
- 12:19 elukey: increase spark shuffle io retry logic (10 tries every 10s)
- 09:08 elukey: add an-worker1114 to the hadoop cluster
- 09:04 klausman: Starting reimaging of stat1007
- 07:32 elukey: bootstrap an-worker111[13] as hadoop workers
2020-10-05
[edit]- 19:14 mforns: restarted oozie coord unique_devices-per_domain-monthly after deployment
- 19:05 mforns: finished deploying refinery to unblock deletion of raw mediawiki_job and raw netflow data
- 18:45 mforns: deploying refinery to unblock deletion of raw mediawiki_job and raw netflow data
- 18:20 elukey: manual creation of /opt/rocm -> /opt/rocm-3.3.0 on stat1008 to avoid failures in finding the lib dir
- 17:11 elukey: bootstrap an-worker[1115-1117] as hadoop workers
- 14:52 milimetric: disabling drop-el-unsanitized-events timer until https://gerrit.wikimedia.org/r/c/analytics/refinery/+/631804/ is deployed
- 14:41 elukey: shutdown stat1005 and stat1008 for ram expansion (1005 again)
- 14:25 elukey: shutdown an-master1001 for ram expansion
- 13:54 elukey: shutdown stat1005 for ram upgrade
- 13:31 elukey: shutdown an-master1002 for ram expansion (64 -> 128G)
- 12:35 elukey: execute "PURGE BINARY LOGS BEFORE '2020-09-28 00:00:00';" on an-coord1001's mysql to free space - T264081
- 10:31 elukey: bootstrap an-worker111[0,2] as hadoop workers
- 10:31 elukey: bootstrap an-worker111[0,2
- 06:33 elukey: reboot stat1005 to resolve weird GPU state (scheduled last week)
2020-10-03
[edit]- 10:35 joal: Manually run mediawiki-history-denormalize after fail-rerun problem (second time)
2020-10-02
[edit]- 16:43 joal: Rerun mediawiki-history-denormalize-wf-2020-09 after failed instance
- 14:23 elukey: live patch refinery-drop-older-than on stat1007 to unblock timer (patch https://gerrit.wikimedia.org/r/6317800)
- 13:00 elukey: add an-worker110[6-9] to the Hadoop cluster
- 06:49 elukey: add an-worker110[0-2] to the hadoop cluster
- 06:33 joal: Manually sqoop page_props and user_properties to unlock mediawiki-history-load oozie job
2020-10-01
[edit]- 19:07 fdans: deploying wikistats
- 19:06 fdans: restarted banner_activity-druid-daily-coord from Sep 26
- 18:59 fdans: restarting mediawiki-history-load-coord
- 18:57 fdans: creating hive table wmf_raw.mediawiki_page_props
- 18:56 fdans: creating hive table wmf_raw.mediawiki_user_properties
- 17:40 elukey: remove + re-create /srv/deployment/analytics/refinery* on stat100[46] (perm issues after reimage)
- 17:32 elukey: remove + re-create /srv/deployment/analytics/refinery on stat1007 (perm issues after reimage)
- 17:18 fdans: deploying refinery
- 14:51 elukey: bootstrap an-worker109[8-9] as hadoop workers (with GPU)
- 13:35 elukey: bootstrap an-worker1097 (GPU node) as hadoop worker
- 13:15 elukey: restart performance-asoranking on stat1007
- 13:15 elukey: execute "sudo chown analytics-privatedata:analytics-privatedata-users /srv/published-datasets/performance/autonomoussystems/*" on stat1007 to fix a perm issue after reimage
- 10:30 elukey: add an-worker1103 to the hadoop cluster
- 07:15 elukey: restart hdfs namenodes on an-master100[1,2] to pick up new hadoop workers settings
- 06:04 elukey: execyte "sudo chown -R analytics-privatedata:analytics-privatedata-users /srv/geoip/archive" on stat1007 - T264152
- 05:58 elukey: execute "sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chown -R analytics-privatedata /wmf/data/archive/geoip" - T264152
2020-09-30
[edit]- 07:29 elukey: execute "alter table superset_production.alerts drop key ix_alerts_active;" on db1108's analytics-meta instance to fix replication after Superset upgrade - T262162
- 07:04 elukey: superset upgraded to 0.37.2 on analytics-tool1004 - T262162
- 05:47 elukey: "PURGE BINARY LOGS BEFORE '2020-09-22 00:00:00';" on an-coord1001's mariadb - T264081
2020-09-28
[edit]- 18:37 elukey: execute "PURGE BINARY LOGS BEFORE '2020-09-20 00:00:00';" on an-coord1001's mariadb as attempt to recover space
- 18:37 elukey: execute "PURGE BINARY LOGS BEFORE '2020-09-15 00:00:00';" on an-coord1001's mariadb as attempt to recover space
- 15:09 elukey: execute set global max_connections=200 on an-coord1001's mariadb (hue reporting too many conns, but in reality the fault is from superset)
- 10:02 elukey: force /srv/jupyterhub/deploy/create_virtual_env.sh on stat1007 after the reimage
- 07:58 elukey: starting the process to decom the old hadoop test cluster
2020-09-27
[edit]- 06:53 elukey: manually ran /usr/bin/find /srv/backup/hadoop/namenode -mtime +14 -delete on an-master1002 to free space on the /srv partition
2020-09-25
[edit]- 16:25 elukey: systemctl reset-failed monitor_refine_eventlogging_legacy_failure_flags.service on an-launcher1002 to clear alerts
- 15:52 elukey: restart hdfs namenodes to correct rack settings of the new host
- 15:42 elukey: add an-worker1096 (GPU worker) to the hadoop cluster
- 08:57 elukey: restart daemons on analytics1052 (journalnode) to verify new TLS setting simplification (no truststore config in ssl-server.xml, not needed)
- 07:18 elukey: restart datanode on analytics1044 after new datanode partition settings (one partition was missing, caught by https://gerrit.wikimedia.org/r/c/operations/puppet/+/629647)
2020-09-24
[edit]- 13:24 elukey: moved the hadoop cluster to puppet TLS certificates
- 13:20 elukey: re-enable timers on an-launcher1002 after maintenance
- 09:51 elukey: stop all timers on an-launcher1002 to ease maintenance
- 09:41 elukey: force re-creation of jupyterhub's default venv on stat1006 after reimage
- 07:29 klausman: Starting reimaging of stat1006
- 06:48 elukey: on an-launcher1002: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -rm -r -skipTrash /var/log/hadoop-yarn/apps/mirrys/logs/*
- 06:45 elukey: on an-launcher1002: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -rm -r -skipTrash /var/log/hadoop-yarn/apps/analytics-privatedata/logs/*
- 06:39 elukey: manually ran "/usr/bin/find /srv/backup/hadoop/namenode -mtime +15 -delete" on an-master1002 to free some space in the backup partition
2020-09-23
[edit]- 07:29 elukey: re-enable timers on al-launcher1002 - maintenance postponed
- 06:06 elukey: stop timers on an-launcher1002 as prep step before maintenance
2020-09-22
[edit]- 06:29 elukey: re-run webrequest-load-text 21/09T21 - failed due to sporadic hive/kerberos issue (SQLException: Could not open client transport with JDBC Uri: jdbc:hive2://an-coord1001.eqiad.wmnet:10000/default;principal=hive/an-coord1001.eqiad.wmnet@WIKIMEDIA: Peer indicated failure: Failure to initialize security context)
2020-09-21
[edit]- 18:00 elukey: execute sudo -u hdfs kerberos-run-command hdfs hdfs dfs -rm -r -skipTrash /var/log/hadoop-yarn/apps/mgerlach/logs/* to free ~30TB of space on HDFS (Replicated)
- 17:44 elukey: restart yarn resource managers on an-master100[1,2] to pick up settings for https://gerrit.wikimedia.org/r/c/operations/puppet/+/628887
- 16:59 joal: Manually add _SUCCESS file to events to hourly-partition of page_move events so that wikidata-item_page_link job starts
- 16:21 joal: Kill restart wikidata-item_page_link-weekly-coord to not wait on missing data
- 15:45 joal: Restart wikidata-json_entity-weekly coordinator after wrong kill in new hue UI
- 15:42 joal: manually killing wikidata-json_entity-weekly-wf-2020-08-31 - Raw data is missing from dumps folder (json dumps)
2020-09-18
[edit]- 15:05 elukey: systemctl reset-failed monitor_refine_eventlogging_legacy_failure_flags.service on an-launcher1002 to clear icinga alrms
- 10:38 elukey: force ./create_virtualenv.sh in /srv/jupyterhub/deploy to update the jupyter's default venv
2020-09-17
[edit]- 10:12 klausman: started backup of stat1004's /srv to stat1008
2020-09-16
[edit]- 19:12 joal: Manually kill webrequest-hour oozie job that started before the restart could happen (waiting for previous hour to be finished)
- 19:00 joal: Kill-restart data-quality-hourly bundle after deploy
- 18:57 joal: Kill-restart webrequest after deploy
- 18:44 joal: Kill restart mediawiki-history-reduced job after deploy
- 17:59 joal: Deploy refinery onto HDFS
- 17:46 joal: Deploy refinery using scap
- 15:27 elukey: update the TLS backend certificate for Analytics UIs (unified one) to include hue-next.w.o as SAN
- 12:11 klausman: stat1008 updated to use rock/rocm DKMS driver and back in operation
- 11:28 klausman: starting to upgrade to rock-dkms driver on stat1008
- 08:11 elukey: superset 0.37.1 deployed to an-tool1005 (staging env)
2020-09-15
[edit]- 13:43 elukey: re-enable timers on an-launcher1002 after maintenance to an-coord1001
- 13:43 elukey: restart of hive/oozie/presto daemons on an-coord1001
- 12:30 elukey: stop timers on an-launcher1002 to drain the cluster and restart an-coord1001's daemons (hive/oozie/presto)
- 06:48 elukey: run systemctl reset-failed monitor_refine_eventlogging_legacy_failure_flags.service on an-launcher1002
2020-09-14
[edit]- 14:36 milimetric: deployed eventstreams with new KafkaSSE version on staging, eqiad, codfw
2020-09-11
[edit]- 15:41 milimetric: restarted data quality stats bundles
- 01:32 milimetric: deployed small fix for hql of editors_bycountry load job
- 00:46 milimetric: deployed refinery source 0.0.136, refinery, and synced to HDFS
2020-09-09
[edit]- 10:11 klausman: Rebooting stat1005 for clearing GPU status and testing new DKMS driver (T260442)
- 07:25 elukey: restart varnishkafka-webrequest on cp5010 and cp5012, delivery reports errors happening since yesterday's network outage
2020-09-04
[edit]- 18:11 milimetric: aqs deploy went well! Geoeditors endpoint is live internally, data load job was successful, will submit pull request for public endpoint.
- 06:54 joal: Manually restart mediawiki-history-drop-snapshot after hive-partitions/hdfs-folders mismatch fix
- 06:08 elukey: reset-failed mediawiki-history-drop-snapshot on an-launcher1002 to clear icinga errors
- 01:52 milimetric: aborted aqs deploy due to cassandra error
2020-09-03
[edit]- 19:15 milimetric: finished deploying refinery and refinery-source, restarting jobs now
- 13:59 milimetric: edit-hourly-druid-wf-2020-08 fails consistently
- 13:56 joal: Kill-restart mediawiki-history-reduced oozie job into production queue
- 13:56 joal: rerun edit-hourly-druid-wf-2020-08 after failed attempt
2020-09-02
[edit]- 18:24 milimetric: restarting mediawiki history denormalize coordinator in production queue, due to failed 2020-08 run
- 08:37 elukey: run kafka preferred-replica-election on jumbo after jumbo1003's reimage to buster
2020-08-31
[edit]- 13:43 elukey: run kafka preferred-replica-election on Jumbo after jumbo1001's reimage
- 07:13 elukey: run kafka preferred-replica-election on Jumbo after jumbo1005's reimage
2020-08-28
[edit]- 14:25 mforns: deployed pageview whitelist with new wiki: ja.wikivoyage
- 14:18 elukey: run kafka preferred-replica-election on jumbo after the reimage of jumbo1006
- 07:21 joal: Manually add ja.wikivoyage to pageview allowlist to prevent alerts
2020-08-27
[edit]- 19:05 mforns: finished refinery deploy (ref v0.0.134)
- 18:41 mforns: starting refinery deploy (ref v0.0.134)
- 18:30 mforns: deployed refinery-source v0.0.134
- 13:29 elukey: restart jvm daemons on analytics1042, aqs1004, kafka-jumbo1001 to pick up new openjdk upgrades (canaries)
2020-08-25
[edit]- 15:47 elukey: restart mariadb@analytics_meta on db1108 to apply a replication filter (exclude superset_staging database from replication)
- 06:35 elukey: restart mediawiki-history-drop-snapshot on an-launcher1002 to check that it works
2020-08-24
[edit]- 06:50 joal: Dropping wikitext-history snapshots 2020-04 and 2020-05 keeping two (2020-06 and 2020-07) to free space in hdfs
2020-08-23
[edit]- 19:34 nuria: deleted 1.2 TB from hdfs://analytics-hadoop/user/analytics/.Trash/200811000000
- 19:31 nuria: deleted 1.2 TB from hdfs://analytics-hadoop/user/nuria/.Trash/*
- 19:26 nuria: deleted 300G from hdfs://analytics-hadoop/user/analytics/.Trash/200814000000
- 19:25 nuria: deleted 1.2 TB from hdfs://analytics-hadoop/user/analytics/.Trash/200808000000
2020-08-20
[edit]- 16:49 joal: Kill restart webrequest-load bundle to move it to production queue
2020-08-14
[edit]- 09:13 fdans: restarting refine to apply T257860
2020-08-13
[edit]- 16:13 fdans: restarting webrequest bundle
- 14:44 fdans: deploying refinery
- 14:13 fdans: updating refinery source symlinks
2020-08-11
[edit]- 17:36 ottomata: refine with refinery-source 0.0.132 and merge_with_hive_schema_before_read=true - T255818
- 14:52 ottomata: scap deploy refinery to an-launcher1002 to get camus wrapper script changes
2020-08-06
[edit]- 14:47 fdans: deploying refinery
- 08:07 elukey: roll restart druid-brokers (on both clusters) to pick up new changes for monitorings
2020-08-05
[edit]- 13:04 elukey: restart yarn resource managers on an-master100[12] to pick up new Yarn settings - https://gerrit.wikimedia.org/r/c/operations/puppet/+/618529
- 13:03 elukey: set yarn_scheduler_minimum_allocation_mb = 1 (was zero) to Hadoop to workaround a Flink 1.1 issue (namely it doesn't work if the value is <= 0)
- 09:32 elukey: set ticket max renewable lifetime to 7d on all kerberos clients (was zero, the default)
2020-08-04
[edit]- 08:30 elukey: resume druid-related oozie coordinator jobs via Hue (after druid upgrade)
- 08:28 elukey: started netflow kafka supervisor on Druid Analytics (after upgrade)
- 08:19 elukey: restore systemd timers for druid jobs on an-launcher1002 (after druid upgrade)
- 07:33 elukey: stop systemd timers related to druid on an-launcher1002
- 07:29 elukey: stop kafka supervisor for netflow on Druid Analytics (prep step for druid upgrade)
- 07:00 elukey: suspend all druid-related coordinators in Hue as prep step for upgrade
2020-08-03
[edit]- 09:53 elukey: move all druid-related systemd timer to spark client mode - T254493
- 08:07 elukey: roll restart aqs on aqs* to pick up new druid settings
2020-08-01
[edit]- 13:22 joal: Rerun cassandra-monthly-wf-local_group_default_T_unique_devices-2020-7 to load missing data (email with bug description sent to list)
2020-07-31
[edit]- 14:46 mforns: restarted webrequest oozie bundle
- 14:46 mforns: restarted mediawiki history reduced oozie job
- 09:00 elukey: SET GLOBAL expire_logs_days=14; on matomo1002's mysql
- 09:00 elukey: SET GLOBAL expire_logs_days=14; on an-coord1001's mysql
- 06:32 elukey: roll restart of druid brokers on druid100[4-8] to pick up new changes
2020-07-30
[edit]- 19:14 mforns: finished refinery deploy (for v0.0.132)
- 18:48 mforns: starting refinery deploy (for v0.0.132)
- 18:27 mforns: deployed refinery-source v0.0.132
2020-07-29
[edit]- 14:37 mforns: quick deployment of pageview white-list
2020-07-28
[edit]- 17:52 ottomata: stopped riting eventlogging data log files on eventlog1002 and stopped syncing them to stat100[67] - T259030
- 14:29 elukey: stop client-side-events-log.service on eventlog1002 to avoid /srv to fill up
- 09:48 elukey: re-enable eventlogging file consumers on eventlog1002
- 09:10 elukey: temporarily stop eventlogging file consumers on eventlog1002 to copy some data over to stat1005 (/srv partition full)
- 08:03 elukey: Superset migrated to CAS
- 06:42 elukey: re-run webrequest-load hour 2020-7-28-3
2020-07-27
[edit]- 17:15 elukey: restart eventlogging on eventlog1002 to update the event whitelist (exclude MobileWebUIClickTracking)
- 08:19 elukey: reset-failed the monitor_refine_failures for eventlogging on an-launcher1002
- 06:44 elukey: truncate big log file on an-launcher1002 that is filling up the /srv partition
2020-07-22
[edit]- 15:05 joal: manually drop /user/analytics/.Trash/200714000000/wmf/data/wmf/pageview/actor to free some space
- 15:03 joal: Manually drop /wmf/data/wmf/mediawiki/wikitext/history/snapshot=2020-03 to free some spqce
- 15:01 elukey: hdfs dfs -rm -r -skipTrash /var/log/hadoop-yarn/apps/analytics-privatedata/logs
- 14:49 elukey: hdfs dfs -rm -r -skipTrash /var/log/hadoop-yarn/apps/analytics/logs/*
- 08:09 elukey: turnilo.wikimedia.org migrated to CAS
2020-07-21
[edit]- 18:30 mforns: finished re-deploying refinery to unbreak unique devices per domain monthly
- 18:05 mforns: re-deploying refinery to unbreak unique devices per domain monthly
- 17:34 mforns: restarted unique_devices-per_domain-daily-coord
- 15:09 elukey: yarn.wikimedia.org migrated earlier on to CAS auth
- 14:58 ottomata: Refine - reverted change to not merge hive schema + event schema before reading - T255818
- 13:36 ottomata: Refine no longer merges with Hive table schema when reading (except for refine_eventlogging_analytics job) - T255818
2020-07-20
[edit]- 19:56 joal: kill-restart cassandra unique-devices loading daily and monthly after deploy (2020-07-20 and 2020-07-01)
- 19:55 joal: kill-restart mediawiki-history-denormalize after dpeloy (2020-07-01)
- 19:55 joal: kill-restart webrequest after dpeloy (2020-07-20T18:00)
- 19:19 mforns: finished refinery deployment (for v0.0.131)
- 19:02 mforns: starting refinery deployment (for v0.0.131)
- 19:02 mforns: deployed refinery-source v0.0.131
- 18:16 joal: Rerun cassandra-daily-coord-local_group_default_T_unique_devices from 2020-07-15 to 2020-07-19 (both included)
- 14:50 elukey: restart superset to pick up TLS to mysql settings
- 14:18 elukey: re-enable timers on an-launcher1002
- 14:01 elukey: resume pageview-daily_dump-coord via Hue to ease the draining + mariadb restart
- 14:00 elukey: restart mariadb on an-coord1001 with TLS settings
- 13:43 elukey: suspend pageview-daily_dump-coord via Hue to ease the draining + mariadb restart
- 12:55 elukey: stop timers on an-launcher1002 to ease the mariadb restart on an-coord1001 (second attempt)
- 09:10 elukey: start timers on an-launcher1002 (no mysql restart happened, long jobs not completing, will postpone)
- 07:16 joal: Restart mobile_apps-session_metrics-wf-7-2020-7-12 after heisenbug kerbe failure
- 06:58 elukey: stop timers on an-launcher1002 to ease the mariadb restart on an-coord1001
2020-07-17
[edit]- 12:34 elukey: deprecate pivot.wikimedia.org (to ease CAS work)
2020-07-15
[edit]- 17:58 joal: Backfill cassandra unique-devices for per-project-family starting 2019-07
- 08:18 elukey: move piwik to CAS (idp.wikimedia.org)
2020-07-14
[edit]- 15:50 elukey: upgrade spark2 on all stat100x hosts
- 15:07 elukey: upgrade spark2 to 2.4.4-bin-hadoop2.6-3 on stat1004
- 14:55 elukey: re-create jupyterhub's venv on stat1005/8 after https://gerrit.wikimedia.org/r/612484
- 14:45 elukey: re-create jupyterhub's base kernel directory on stat1005 (trying to debug some problems)
- 07:27 joal: Restart forgotten unique-devices per-project-family jobs after yesterday deploy
2020-07-13
[edit]- 20:17 milimetric: deployed weekly train with two oozie job bugfixes and rename to pageview_actor table
- 19:42 joal: Deploy refinery with scap
- 19:24 joal: Drop pageview_actor_hourly and replace it by pageview_actor
- 18:26 joal: Kill pageview_actor_hourly and unique_devices_per_project_family jobs to copy backfilled data
- 12:35 joal: Start backfilling of wdqs_internal (external had been done, not internal :S)
2020-07-10
[edit]- 17:10 nuria: updating the EL whitelist, refinery reploy (but not source)
- 16:01 milimetric: deployed, EL whitelist is updated
2020-07-09
[edit]- 18:52 elukey: upgrade spark2 to 2.4.4-bin-hadoop2.6-3 on stat1008
2020-07-07
[edit]- 10:12 elukey: decom archiva1001
2020-07-06
[edit]- 08:09 elukey: roll restart aqs on aqs100[4-9] to pick up new druid settings
- 07:51 elukey: enable binlog on matomo's database on matomo1002
2020-07-04
[edit]- 10:52 joal: Rerun mediawiki-geoeditors-monthly-wf-2020-06 after heisenbug (patch provided for long-term fix)
2020-07-03
[edit]- 19:20 joal: restart failed webrequest-load job webrequest-load-wf-text-2020-7-3-17 with higher thresholds - error due to burst of requests in ulsfo
- 19:13 joal: restart mediawiki-history-denormalize oozie job using 0.0.115 refinery-job jar
- 19:05 joal: kill manual execution of mediawiki-history to save an-coord1001 (too big of a spark-driver)
- 18:53 joal: restart webrequest-load-wf-text-2020-7-3-17 after hive server failure
- 18:52 joal: restart data_quality_stats-wf-event.navigationtiming-useragent_entropy-hourly-2020-7-3-15 after have server failure
- 18:51 joal: restart virtualpageview-hourly-wf-2020-7-3-15 after hive-server failure
- 16:41 joal: Rerun mediawiki-history-check_denormalize-wf-2020-06 after having cleaned up wrong files and restarted a job without deterministic skewed join
2020-07-02
[edit]- 18:16 joal: Launch a manual instance of mediawiki-history-denormalize to release data despite oozie failing
- 16:17 joal: rerun mediawiki-history-denormalize-wf-2020-06 after oozie sharelib bump through manual restart
- 12:41 joal: retry mediawiki-history-denormalize-wf-2020-06
- 07:26 elukey: start a tmux on an-launcher1002 with 'sudo -u analytics /usr/local/bin/kerberos-run-command analytics /usr/local/bin/refinery-sqoop-mediawiki-production'
- 07:20 elukey: execute systemctl reset-failed refinery-sqoop-whole-mediawiki.service to clear our alarms on launcher1002
2020-07-01
[edit]- 19:04 joal: Kill/restart webrequest-load-bundle for mobile-pageview update
- 18:59 joal: kill/restart pageview-druid jobs (hourly, daily, monthly) for in_content_namespace field update
- 18:57 joal: kill/restart mediawiki-wikitext-history-coord and mediawiki-wikitext-current-coord for bz2 codec update
- 18:55 joal: kill/restart mediawiki-history-denormalize-coord after skewed-join strategy update
- 18:52 joal: Kill/Restart unique_devices-per_project_family-monthly-coord after fix
- 18:41 joal: deploy refinery to HDFS
- 18:28 joal: Deploy refinery using scap after hotfix
- 18:20 joal: Deploy refinery using scap
- 16:58 joal: trying to release refinery-source 0.0.129 to archiva, version 3
- 16:51 elukey: remove /etc/maven/settings.xml from all analytics nodes that have it
2020-06-30
[edit]- 18:28 joal: trying to release refinery-source to archiva from jenkins (second time)
- 16:30 joal: Release refinery-source v0.0.129 using jenkins
- 16:30 joal: Deploy refien
- 16:05 elukey: re-enable timers on an-launcher1002 after archiva maintenance
- 15:23 elukey: stop timers on an-launcher1002 to ease debugging for refinery deploy
- 13:12 elukey: restart nodemanager on analytics1068 after GC overhead and OOMs
- 09:32 joal: Kill/Restart mediawiki-wikitext-history job now that the current month one is done (bz2 fix)
2020-06-29
[edit]- 13:09 elukey: archiva.wikimedia.org migrated to archiva1002
2020-06-25
[edit]- 17:20 elukey: move RU jobs/timers from an-launcher1001 to an-launcher1002
- 16:07 elukey: move all timers but RU from an-launcher1001 to 1002 (puppet disabled on 1001, all timers completed)
- 12:13 elukey: reimage notebook1003/4 to debian buster as fresh start
- 09:28 joal: Kill-restart pageview-hourly to read from pageview_actor
- 09:25 joal: Kill-restart pageview_actor jobs (current+backfill) after dpeloy
- 09:14 joal: Deploy refinery to HDFS
- 08:56 joal: deploying refinery using scap to fix pageview_actor_hourly
- 08:02 joal: Start backfilling pageview_actor_hourly job with new patch (expected to solve heisenbug)
- 07:40 joal: Dropping refinery-camus jars from archiva up to 0.0.115
- 07:04 joal: rerun failed pageview_actor_hourly
2020-06-24
[edit]- 19:36 joal: Cleaning refinery-spark from archiva (up to 0.0.115)
- 19:28 joal: Cleaning refinery-tools from archiva (up to 0.0.115)
- 19:16 joal: Restarting unique-devices jobs to use pageview_actor_hourly instead of webrequest (4 jobs)
- 19:08 joal: Start pageview_actor_hourly oozie job
- 19:06 joal: Create pageview_actor_hourly after deploy to start new jobs
- 18:57 joal: Clean archiva refinery-camus except 0.0.90
- 18:54 joal: Deploying refinery onto HDFS
- 18:47 joal: clean archiva from refinery-hive (up to 0.0.115)
- 18:47 joal: Deploying refinery using scap
- 18:15 joal: launching a new jenkins release after cleanup
- 17:43 joal: Reseting refinery-source to v0.0.128 for clean release after jenkins-archiva password fix
- 16:20 joal: Releasing refinery-source 0.0.128 to archiva
- 06:50 elukey: truncate /srv/reportupdater/log/reportupdater-ee-beta-features from 43G to 1G on an-launcher1001 (disk space issues)
2020-06-22
[edit]- 18:50 joal: Manually update pageview whitelist adding shnwiktionary
2020-06-20
[edit]- 07:41 elukey: powercycle an-worker1093 - bug soft lock up CPU showed in mgmt console
- 07:37 elukey: powercycle an-worker1091 - bug soft lock up CPU showed in mgmt console
2020-06-17
[edit]- 19:59 milimetric: deployed quick fix for data stats job
- 18:04 elukey: decommission matomo1001
- 16:57 ottomata: produce searchsatisfaction events on group0 wikis via eventgate - T249261
- 07:17 joal: Deleting mediawiki-history-text (avro) for 2020-01 and 2020-02 (we still have 2020-03 and 2020-04) - Expected free space: 160Tb
- 06:40 elukey: reboot krb1001 for kernel upgrades
- 06:24 elukey: reboot an-master100[1,2] for kernel upgrades
- 06:03 elukey: reboot an-conf100[1-3] for kernel upgrades
- 05:45 elukey: reboot stat1007/8 for kernel upgrades
2020-06-16
[edit]- 19:58 ottomata: evolving event.SearchSatisfaction Hive table using /analytics/legacy/searchsatisfaction/latest schema
- 19:41 ottomata: bumping Refine refinery jar version to 0.0.127 - T238230
- 19:17 ottomata: deploying refinery source 0.0.127 for eventlogging -> eventgate migration - T249261
- 16:02 elukey: reboot kafka-jumbo1008 for kernel upgrades
- 15:33 milimetric: refinery deployed and synced to hdfs, with refinery-source at 0.0.126
- 15:20 elukey: reboot kafka-jumbo1007 for kernel upgrades
- 15:13 elukey: re-enabling timers on launcher after maintenance
- 15:06 elukey: reboot an-coord1001 for kernel upgrades
- 14:27 elukey: stop timers on an-launcher1001, prep before rebooting an-coord1001
- 14:23 elukey: reboot druid100[7,8] for kernel upgrades
- 11:51 elukey: re-run webrequest-druid-hourly-coord 16/06T10
- 11:36 elukey: reboot an-druid100[1,2] for kernel upgrades
2020-06-15
[edit]- 09:37 elukey: restart refinery-druid-drop-public-snapshots.service after change in vlan firewall rules (added druid100[7,8] to term druid)
2020-06-11
[edit]- 15:01 mforns: started refinery deploy for v0.0.126
- 14:58 mforns: deployed refinery-source v0.0.126
- 13:57 ottomata: removed accidentally added page_restrictions column(s) on Hive table event.mediawiki_user_blocks_change after a incorrect schema change was merged (no data was ever set in this column)
2020-06-09
[edit]- 07:32 elukey: upgrade ROCm to 3.3 on stat1005
2020-06-08
[edit]- 15:42 elukey: remove access to notebook100[3,4] - T249752
- 14:07 elukey: move matomo cron archiver to systemd timer archiver (with nagios alarming)
- 14:02 elukey: re-enable timers on an-coord1001
- 14:01 elukey: restart hive/oozie on an-coord1001 for openjdk upgrades
- 13:42 elukey: roll restart kafka jumbo brokers for openjdk upgrades
- 13:26 elukey: stop timers on an-launcher to drain jobs and restart hive/oozie for openjdk upgrades
2020-06-05
[edit]- 17:56 elukey: roll restart presto server on an-presto* to pick up new openjdk upgrades
- 16:45 elukey: upgrade turnilo to 1.24.0
- 13:26 elukey: reimage druid1006 to debian buster
- 09:26 elukey: roll restart cassandra on AQS to pick up openjdk upgrades
2020-06-04
[edit]- 19:12 elukey: roll restart of aqs to pick up new druid settings
- 18:39 mforns: deployed wikistats2 2.7.5
- 13:33 elukey: re-enable netflow hive2druid jobs after https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/602356/
- 10:56 elukey: depooled and reimage druid1004 to Debian Buster (Druid public cluster)
- 07:31 elukey: stop netflow hive2druid timers to do some experiments
- 06:13 elukey: kill application_1589903254658_75731 (druid indexation for netflow still running since 12h ago)
- 05:36 elukey: restart druid middlemanager on druid1002 - strange protobuf warnings, netflow hive2druid indexation job stuck for hours
- 05:13 elukey: reimage druid1003 to Buster
2020-06-03
[edit]- 17:10 elukey: restart RU jobs after adding memory to an-launcher1001
- 16:57 elukey: reboot an-launcher1001 to get new memory
- 16:01 elukey: stop timers on an-launcher, prep for reboot
- 09:35 elukey: re-run webrequest-druid-hourly-coord 03/06T7 (failed due to druid1002 moving to buster)
- 08:50 elukey: reimage druid1002 to Buster
2020-06-01
[edit]- 14:54 elukey: stop all timers on an-launcher1001, prep step for reboot
- 12:54 elukey: /user/dedcode/.Trash/* -skipTrash
- 06:53 elukey: re-run virtualpageview-hourly-wf-2020-5-31-19
- 06:28 elukey: temporary stop of all RU jobs on an-launcher1001 to priviledge camus and others
- 06:03 elukey: kill all airflow-related processes on an-launcher1001 - host killing tasks due to OOM
2020-05-30
[edit]- 08:15 elukey: manual reset-failed of monitor_refine_mediawiki_job_events_failure_flags
2020-05-29
[edit]- 13:19 elukey: re-run druid webrequest hourly 29/05T11 (failed due to a host reimage in progress)
- 12:19 elukey: reimage druid1001 to Debian Buster
- 10:05 elukey: move el2druid config from druid1001 to an-druid1001
2020-05-28
[edit]- 18:31 milimetric: after deployment, restarted four oozie jobs with new SLAs and fixed datasets definitions
- 06:40 elukey: slowly restarting all RU units on an-launcher1001
- 06:32 elukey: delete old RU pid files with timestamp May 27 19:00 (scap deployment failed to an-launcher due to disk issues) except ./jobs/reportupdater-queries/pingback/.reportupdater.pid that was working fine
2020-05-27
[edit]- 19:53 joal: Start pageview-complete dump oozie job after deploy
- 19:24 joal: Deploy refinery onto hdfs
- 19:22 joal: restart failed services on an-launcher1001
- 19:06 joal: Deploy refinery using scap to an-launcher1001 only
- 18:41 joal: Deploying refinery with scap
- 13:42 ottomata: increased Kafka topic retention in jumbo-eqiad to 31 days for (eqiad|codfw).mediawiki.revision-create - T253753
- 07:09 joal: Rerun webrequest-druid-hourly-wf-2020-5-26-17
- 07:04 elukey: matomo upgraded to 3.13.5 on matomo1001
- 06:17 elukey: superset upgraded to 0.36
- 05:52 elukey: attempt to upgrade Superset to 0.36 - downtime expected
2020-05-24
[edit]- 10:04 elukey: re-run virtualpageview-hourly 23/05T15 - failed due to a sporadic kerberos/hive issue
2020-05-22
[edit]- 09:11 elukey: superset upgrade attempt to 0.36 failed due to a db upgrade error (not seen in staging), rollback to 0.35.2
- 08:15 elukey: superset down for maintenance
- 07:09 elukey: add druid100[7,8] to the LVS druid-public-brokers service (serving AQS's traffic)
2020-05-21
[edit]- 17:24 elukey: add druid100[7,8] to the druid public cluster (not serving load balancer traffic for the moment, only joining the cluster) - T252771
- 16:44 elukey: roll restart druid historical nodes on druid100[4-6] (public cluster) to pick up new settings - T252771
- 14:02 elukey: restart druid kafka supervisor for wmf_netflow after maintenance
- 13:53 elukey: restart druid-historical on an-druid100[1,2] to pick up new settings
- 13:17 elukey: kill wmf_netflow druid supervisor for maintenance
- 13:13 elukey: stop druid-daemons on druid100[1-3] (one at the time) to move the druid partition from /srv/druid to /srv (didn't think about it before) - T252771
- 09:16 elukey: move Druid Analytics SQL in Superset to druid://an-druid1001.eqiad.wmnet:8082/druid/v2/sql/
- 09:05 elukey: move turnilo to an-druid1001 (beefier host)
- 08:15 elukey: roll restart of all druid historicals in the analytics cluster to pick up new settings
2020-05-20
[edit]- 13:55 milimetric: deployed refinery with refinery-source v0.0.125
2020-05-19
[edit]- 15:28 elukey: restart hadoop master daemons on an-master100[1,2] for openjdk upgrades
- 06:29 elukey: roll restart zookeeper on druid100[4-6] for openjdk upgrades
- 06:18 elukey: roll restart zookeeper on druid100[1-3] for openjdk upgrades
2020-05-18
[edit]- 14:02 elukey: roll restart of hadoop daemons on the prod cluster for openjdk upgrades
- 13:30 elukey: roll restart hadoop daemons on the test cluster for openjdk upgrades
- 10:33 elukey: add an-druid100[1,2] to the Druid Analytics cluster
2020-05-15
[edit]- 13:23 elukey: roll restart of the Druid analytics cluster to pick up new openjdk + /srv completed
- 13:15 elukey: turnilo back to druid1001
- 13:03 elukey: move turnilo config to druid1002 to ease druid maintenance
- 12:31 elukey: move superset config to druid1002 (was druid1003) to ease maintenance
- 09:08 elukey: restart druid brokers on Analytics Public
2020-05-14
[edit]- 18:41 ottomata: fixed TLS authentication for Kafka mirror maker on jumbo - T250250
- 12:49 joal: Release 2020-04 mediawiki_history_reduced to public druid for AQS (elukey did it :-P)
- 09:53 elukey: upgrade matomo to 3.13.3
- 09:50 elukey: set matomo in maintenance mode as prep step for upgrade
2020-05-13
[edit]- 21:36 elukey: powercycle analytics1055
- 13:46 elukey: upgrade spark2 on all stat100x hosts - T250161
- 06:47 elukey: upgrade spark2 on stat1004 - canary host - T250161
2020-05-11
[edit]- 10:17 elukey: re-run webrequest-load-wf-text-2020-5-11-9
- 06:06 elukey: restart wikimedia-discovery-golden on stat1007 - apparenlty killed by no memory left to allocate on the system
- 05:14 elukey: force re-run of monitor_refine_event_failure_flags after fixing a refine failed hour
2020-05-10
[edit]- 07:44 joal: Rerun webrequest-load-wf-upload-2020-5-10-1
2020-05-08
[edit]- 21:06 ottomata: running prefered replica election for kafka-jumbo to get preferred leaders back after reboot of broker earlier today - T252203
- 15:36 ottomata: starting kafka broker on kafka-jumbo1006, same issue on other brokers when they are leaders of offending partitions - T252203
- 15:27 ottomata: stopping kafka broker on kafka-jumbo1006 to investigate camus import failures - T252203
- 15:16 ottomata: restarted turnilo after applying nuria and mforns changes
2020-05-07
[edit]- 17:39 ottomata: deploying fix to refinery bin/camus CamusPartitionChecker when using dynamic stream configs
- 16:49 joal: Restart and babysit mediawiki-history-denormalize-wf-2020-04
- 16:37 elukey: roll restart of all the nodemanagers on the hadoop cluster to pick up new jvm settings
- 13:53 elukey: move stat1007 to role::statistics::explorer (adding jupyterhub)
- 11:00 joal: Moving application_1583418280867_334532 to the nice queue
- 10:58 joal: Rerun wikidata-articleplaceholder_metrics-wf-2020-5-6
- 07:45 elukey: re-run mediawiki-history-denormalize
- 07:43 elukey: kill application_1583418280867_333560 after a chat with David, the job is consuming ~2TB of RAM
- 07:32 elukey: re-run mediawiki history load
- 07:18 elukey: execute yarn application -movetoqueue application_1583418280867_332862 -queue root.nice
- 07:06 elukey: restart mediawiki-history-load via hue
- 06:41 elukey: restart oozie on an-coord1001
- 05:46 elukey: re-run mediarequest-hourly-wf-2020-5-6-19
- 05:35 elukey: re-run two failed hours for webrequest load text (07/05T05) and upload (06/05T23)
- 05:33 elukey: restart hadoop yarn nodemanager on analytics1071
2020-05-06
[edit]- 12:49 elukey: restart oozie on an-coord1001 to pick up the new shlib retention changes
- 12:28 mforns: re-run pageview-druid-hourly-coord for 2020-05-06T06:00:00 after oozie shared lib update
- 11:30 elukey: use /run/user as kerberos credential cache for stat1005
- 09:25 elukey: re-run projectview coordinator for 2020-5-6-5 after oozie shared lib update
- 09:24 elukey: re-run virtualpageview coordinator for 2020-5-6-5 after oozie shared lib update
- 09:13 elukey: re-run apis coordinator for 2020-5-6-7 after oozie shared lib update
- 09:11 elukey: re-run learning features actor coordinator for 2020-5-6-7 after oozie shared lib update
- 09:10 elukey: re-run aqs-hourly coordinator for 2020-5-6-7 after oozie shared lib update
- 09:09 elukey: re-run mediacounts coordinator for 2020-5-6-7 after oozie shared lib update
- 09:08 elukey: re-run mediarequest coordinator for 2020-5-6-7 after oozie shared lib update
- 09:08 elukey: re-run data quality coordinators for 2020-5-6-5/6 after oozie shared lib update
- 09:05 elukey: re-run pageview-hourly coordinator 2020-5-6-6 after oozie shared lib update
- 09:04 elukey: execute oozie admin -sharelibupdate on an-coord1001
- 06:05 elukey: execute hdfs dfs -chown -R analytics-search:analytics-search-users /wmf/data/discovery/search_satisfaction/daily/year=2019
2020-05-05
[edit]- 19:49 mforns: Finished re-deploying refinery using scap, then re-deploying onto hdfs
- 18:47 mforns: Finished deploying refinery using scap, then deploying onto hdfs
- 18:13 mforns: Deploying refinery using scap, then deploying onto hdfs
- 18:02 mforns: Deployed refinery-source using the awesome new jenkins jobs :]
- 13:15 joal: Dropping unavailable mediawiki-history-reduced datasources from superset
2020-05-04
[edit]- 17:08 joal: Restart refinery-sqoop-mediawiki-private.service on an-launcher1001
- 17:03 elukey: restart refinery-drop-webrequest-refined-partitions after manual chown
- 17:03 joal: Restart refinery-sqoop-whole-mediawiki.service on an-launcher1001
- 17:02 elukey: chown analytics (was: hdfs) /wmf/data/wmf/webrequest/webrequest_source=text/year=2019/month=12/day=14/hour={13,18}
- 16:44 joal: Deploy refinery again using scap (trying to fox sqoop)
- 15:39 joal: restart refinery-sqoop-whole-mediawiki.service
- 15:37 joal: restart refinery-sqoop-mediawiki-private.service
- 14:50 joal: Deploy refinery using scap to fix sqoop
- 13:43 elukey: restart refinery-sqoop-whole-mediawiki to test failure exit codes
- 06:50 elukey: upgrade druid-exporter on all druid nodes
2020-05-03
[edit]- 19:36 joal: Rerun mobile_apps-session_metrics-wf-7-2020-4-26
2020-05-02
[edit]- 10:54 joal: Rerun predictions-actor-hourly-wf-2020-5-2-0
2020-05-01
[edit]- 16:59 elukey: test prometheus-druid-exporter 0.8 on druid1001 (deb packages not yet uploaded, just build and manually installed)
2020-04-30
[edit]- 10:36 elukey: run superset init to add missing perms on an-tool1005 and analytics-tool1004 - T249681
- 07:14 elukey: correct X-Forwarded-Proto for superset (http -> https) and restart it
2020-04-29
[edit]- 18:55 joal: Kill-restart cassandra-daily-coord-local_group_default_T_pageviews_per_article_flat
- 18:46 joal: Kill-restart pageview-hourly job
- 18:45 joal: No restart needed for pageview-druid jobs
- 18:36 joal: kill restart pageview-druid jobs (hourly, daily, monthly) to add new dimension
- 18:29 joal: Kill-restart data-quality-stats-hourly bundle
- 17:57 joal: Deploy refinery on HDFS
- 17:45 elukey: roll restart Presto workers to pick up the new jvm settings (110G heap size)
- 16:06 joal: Deploying refinery using scap
- 15:57 joal: Deploying AQS using scap
- 14:26 elukey: enable TLS consumer/producers for kafka main -> jumbo mirror maker - T250250
- 13:48 joal: Releasing refinery 0.0.123 onto archiva with Jenkins
- 08:47 elukey: roll restart zookeeper on an-conf* to pick up new openjdk11 updates (affects hadoop)
2020-04-27
[edit]- 13:02 elukey: superset 0.36.0 deployed to an-tool1005
2020-04-26
[edit]- 18:14 elukey: restart nodemanager on analytics1054 - failed due to heap pressure
- 18:14 elukey: re-run webrequest-load-coord-text 26/04/2020T16 via Hue
2020-04-23
[edit]- 13:57 elukey: launch again data quality stats bundle with https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/592008/ applied locally
2020-04-22
[edit]- 06:46 elukey: kill dataquality hourly bundle again, traffic_by_country keeps failing
- 06:11 elukey: start data quality bundle hourly with --user=analytics
- 05:45 elukey: add a separate refinery scap target for the Hadoop test cluster and redeploy to check new settings
2020-04-21
[edit]- 23:17 milimetric: restarted webrequest bundle, babysitting that first before going on
- 23:00 milimetric: forgot a small jar version update, finished deploying now
- 21:38 milimetric: deployed twice because analytics1030 failed with "OSError {}" but seems ok after the second deploy
- 14:27 elukey: add motd to notebook100[3,4] to alert about host deprecation (in favor of stat100x)
- 11:51 elukey: manually add SUCCESS flags under /wmf/data/wmf/banner_activity/daily/year=2020/month=1 and /wmf/data/wmf/banner_activity/daily/year=2019/month=12 to unblock druid banner monthly indexations
2020-04-20
[edit]- 14:38 ottomata: restarting eventlogging-processor with updated python3-ua-parser for parsing KaiOS user ageints
- 10:28 elukey: drop /srv/log/mw-log/archive/api from stat1007 (freeing 1.3TB of space!)
2020-04-18
[edit]- 21:40 elukey: force hdfs-balancer as attempt to redistribute hdfs blocks more evenly to worker nodes (hoping to free the busiest ones)
- 21:32 elukey: drop /user/analytics-privatedata/.Trash/* from hdfs to free some space (~100G used)
- 21:25 elukey: drop /var/log/hadoop-yarn/apps/analytics-search/* from hdfs to free space (~8T replicated used)
- 21:21 elukey: drop /user/{analytics|hdfs}/.Trash/* from hdfs to free space (~100T used)
- 21:12 elukey: drop /var/log/hadoop-yarn/apps/analytics from hdfs to free space (15.1T replicated)
2020-04-17
[edit]- 13:45 elukey: lock down /srv/log/mw-log/archive/ on stat1007 to analytics-privatedata-users access only
- 10:26 elukey: re-created default venv for notebooks on notebook100[3,4] (missed to git pull before re-creaing it the last time)
2020-04-16
[edit]- 05:34 elukey: restart hadoop-yarn-nodemanager on an-worker108[4,5] - failed after GC OOM events (heavy spark jobs)
2020-04-15
[edit]- 14:03 elukey: update Superset Alpha role perms with what stated in T249923#6058862
- 09:35 elukey: restart jupyterhub too as follow up
- 09:35 elukey: execute "create_virtualenv.sh ../venv" on stat1006, notebook1003, notebook1004 to apply new settings to Spark kernels (re-creating them)
- 09:09 elukey: restart druid brokers on druid100[4-6] - stuck after datasource deletion
2020-04-11
[edit]- 09:19 elukey: set hive-security: read-only for the Presto hive connector and roll restart the cluster
2020-04-10
[edit]- 16:31 elukey: enable TLS from kafkatee to Kafka on analytics1030 (test instance)
- 15:45 elukey: migrate data_purge timers from an-coord1001 to an-launcher1001
- 09:11 elukey: move druid_load jobs from an-coord1001 to an-launcher1001
- 08:08 elukey: move project_namespace_map from an-coord1001 to an-launcher1001
- 07:38 elukey: move hdfs-cleaner from an-coord1001 to an-launcher1001
2020-04-09
[edit]- 20:54 elukey: re-run webrequest upload/text hour 15:00 from Hue (stuck due to missing _IMPORTED flag, caused by an-launcher1001 migration. Andrew fixed it re-running manually the Camus checker)
- 16:00 elukey: move camus timers from an-coord1001 to an-launcher1001
- 15:20 elukey: absent spark refine timers on an-coord1001 and move them to an-launcher1001
2020-04-07
[edit]- 09:17 elukey: enable refine for TwoColConflictExit (EL schema)
2020-04-06
[edit]- 13:23 elukey: upgraded stat1008 to AMD ROCm 3.3 (enables tensorflow 2.x)
- 12:33 joal: Bump AQS druid backend to 2020-03
- 11:50 elukey: deploy new druid datasource in Druid public
- 06:29 elukey: allow all analytics-privatedata-users to use the GPUs on stat1005/8
2020-04-04
[edit]- 06:52 elukey: restart refinery-import-page-history-dumps
2020-04-03
[edit]- 09:57 elukey: remove TwoColConflictExit from eventlogging's refine blacklist
2020-04-02
[edit]- 19:31 joal: restart paegviewhourly job after manual patch
- 19:29 joal: Manually patching last deploy to fic virtualpageview job - code merged
- 17:48 joal: Kill/restart virtualpageview-hourly-coord after deploy
- 16:55 joal: Deploy refinery onto HDFS
- 16:30 joal: Deploy refinery using scap
- 16:12 elukey: re-enable timers on an-coord1001 after maintenance
- 15:52 elukey: restart hive server2/metastore with G1 settings
- 14:05 elukey: temporary stop timers on an-coord1001 to facilitate hive daemons restarts
- 13:47 hashar: test 1 2 3
- 13:30 joal: Releasing refinery-source v0.0.121 using new jenkins-docker :)
- 08:23 elukey: kill/restart netflow realtime druid indexation with a new dimension (peer_ip_src) - T246186
2020-04-01
[edit]- 21:19 joal: restart pageview-hourly-wf-2020-4-1-15
- 18:24 joal: Kill learning-features-actor-hourly as new version to come
- 18:23 joal: Restart unique_devices-per_project_family-monthly-wf-2020-3 and aqs-hourly-wf-2020-4-1-15 after hive fialure
- 18:21 joal: restart webrequest-load-wf-upload-2020-4-1-16 and webrequest-load-wf-text-2020-4-1-16 after hive failure
- 18:14 joal: Kill groceryheist job taking half the cluster
- 18:06 ottomata: restarted hive-server2
- 10:07 jbond42: updating icu packages
2020-03-31
[edit]- 12:57 jbond42: updating icu on presto-analytics-canary and hadoop-worker-canary
2020-03-30
[edit]- 07:27 elukey: run /usr/local/bin/refine_sanitize_eventlogging_analytics_immediate --ignore_failure_flag=true --since=72 --verbose --table_whitelist_regex="ResourceTiming" refine_sanitize_eventlogging_analytics_immediate to fix _REFINE_FAILED events
- 07:16 elukey: run eventlogging refine manually for schemas "EditorActivation|EditorJourney|HomepageVisit|VisualEditorFeatureUse|WikibaseTermboxInteraction|UploadWizardErrorFlowEvent|MobileWikiAppiOSReadingLists|ContentTranslationCTA|QuickSurveysResponses|MobileWikiAppiOSSessions to fix _REFINE_FAILED events
2020-03-29
[edit]- 08:44 elukey: blacklist TwoColConflictExit from Eventlogging Refine to avoid alarm spam
2020-03-28
[edit]- 16:54 elukey: restart yarn nodemanger on analytics1071 - network errors in the logs
2020-03-27
[edit]- 08:09 elukey: deployed new kernerls for https://gerrit.wikimedia.org/r/580083 on stat1004
2020-03-26
[edit]- 09:09 elukey: re-running manually webrequest-load upload 26/03/2020T08 - kerberos failures
2020-03-25
[edit]- 08:14 elukey: restart presto-server on an-coord1001 to remove jmx catalog config
2020-03-24
[edit]- 15:46 elukey: restart all cron.service processes on stat/notebook (killing long lingering processes) to move the unit under user.slice
2020-03-21
[edit]- 14:17 joal: Restart wikidata_item_page_link job with manual fix - review to be confirmed
- 14:06 joal: Kill buggy wikidata_item_page_link job
2020-03-18
[edit]- 19:39 fdans: refinery deployed
- 18:52 fdans: deploying refinery
- 18:51 fdans: refinery source 0.0.119 jars generated and symlinked
- 18:17 fdans: beginning deploy of refinery-source 0.0.119
2020-03-17
[edit]- 17:25 elukey: deploy superset to enable Presto and Kerberos (Pyhive 0.6.2.)
2020-03-16
[edit]- 19:43 joal: Kill-restart wikidata-articleplaceholder_metrics-coord to fix yarn queue
- 18:30 mforns: Deployed refinery using scap, then deployed onto hdfs
- 17:05 elukey: roll restart of hadoop namenodes to get the new GC setting (MaxGCPauseMillis 400 -> 1000)
2020-03-13
[edit]- 12:18 joal: Restart cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2020-3-12
2020-03-12
[edit]- 22:53 mforns: Deployed refinery using scap, then deployed onto hdfs
- 22:22 mforns: deployed refinery-source using jenkins
- 11:09 elukey: roll restart kerberos kdcs to pick up new ticket lifetime settings (10h -> 48h)
- 08:27 elukey: re-running refine eventlogging with --since 12 (very conservative but just in case)
2020-03-11
[edit]- 14:49 elukey: add xmldumps mountpoints on stat1004 and stat1005
2020-03-10
[edit]- 15:20 elukey: remove the analytics user keytab from stat100[4,5]
- 15:06 elukey: move stat1006 to role::statistics::explorer
- 09:24 elukey: removed /etc/mysql/conf.d/stats-research-client.cnf from all stat boxes (all file used for RU, now on an-launcher1001)
2020-03-09
[edit]- 07:27 elukey: deploy jupyterhub on notebook100[3,4] (manual venv re-creation) to allow the use of the user.slice - T247055
- 07:26 elukey: upgrade nodejs from 6->10 on stat1* and notebook1*
2020-03-08
[edit]- 17:58 elukey: restart hadoop-yarn-nodemanger on an-worker1087
2020-03-06
[edit]- 14:58 joal: AQS new druid snapshot released (2020-02)
- 10:06 elukey: roll restart Presto daemons for openjdk upgrades
- 09:45 elukey: roll restart of cassandra on AQS to pick up new openjdk upgrades
2020-03-05
[edit]- 19:45 elukey: deleted dangling 'reports' symlink on stat100[6,7] in /srv/published
- 19:39 elukey: mv /srv/reportupdater to /srv/reportupdater-backup05032020 on stat100[6,7]
- 16:34 mforns: restart turnilo to refresh deleted datasources
- 14:16 elukey: restart hdfs/yarn master daemons to pick up new core-site changes for Superset
- 06:48 elukey: restart yarn on analytics1074 (GC overhead, traces of network errors with datanodes)
2020-03-04
[edit]- 08:41 joal: Kill-restart mediawiki-history-reduced-coord
- 08:38 joal: Kill-restart mediawiki-history-dumps-coord
2020-03-03
[edit]- 21:19 joal: Kill-restart actor jobs
- 21:17 joal: kill-restart mediawiki-history-check_denormalize-coord
- 21:16 joal: Kill-restart mediawiki-history job
- 21:10 joal: Kill Wikidataplaceholder failling coord
- 21:08 joal: Kill restart wikidata-specialentitydata_metrics-coord
- 21:07 joal: Start Wikidataplaceholder job
- 21:06 joal: Kill/restart edit_hourly job
- 21:04 joal: Start wikidata_item_page_link coordinator
- 20:46 joal: Deploy refinery onto HDFS
- 20:34 joal: Deploy refinery using scap
- 20:28 joal: Add new jars to refinery using Jenkins
- 20:01 joal: Release refinery-source v0.0.117 with Jenkins
- 16:37 mforns: restarted turnilo to refresh deleted test datasource
- 11:56 joal: Kill actor-hourly oozie test jobs (precision of previous message)
- 11:55 joal: Kill actor-hourly tests
- 10:50 elukey: restarted kafka jumbo (kafka + mirror maker) for openjdk upgrades
- 09:22 joal: Rerunning failed mediawiki-history jobs for 2020-02 after mediawiki-history-denormalize issue
- 09:16 joal: Manually restarting mediawiki-history-denormalize with new patch to try
- 08:36 elukey: roll restart kafka-jumbo for openjdk upgrades
- 08:34 elukey: re-enable timers on an-coord1001 after maintenance
- 08:30 joal: Correct previsou message: Kill mediawiki-history (not mediawiki-history-reduced) as it is failing
- 08:30 joal: Kill mediawiki-history-reduced as it is failing
- 08:22 elukey: hive metastore/server2 now running without zookeeper settings and without DBTokenStore (in memory one used instead, the default)
- 08:19 elukey: restart oozie/hive daemons on an-coord1001 for openjdk upgrades
- 06:41 elukey: roll restart druid daemons for openjdk upgrades
- 06:39 elukey: sto timers on an-coord1001 to facilitate daemon restarts (hive/oozie)
2020-03-02
[edit]- 19:58 joal: Remove faulty _REFINED file at /wmf/data/event/mediawiki_revision_score/datacenter=eqiad/year=2020/month=3/day=2/hour=10/_REFINED
- 15:38 elukey: apply new settings to all stat/notebooks
- 15:31 elukey: setting new user.slice global memory/cpu settings on notebook1003
- 15:25 elukey: setting new user-slice global memory/cpu settings on stat1007
2020-02-28
[edit]- 19:10 milimetric: deployed 0.0.116 and restarted webrequest load bundle at 2020-02-28T14
- 14:49 joal: Drop test keyspaces in cassandra cluster
2020-02-27
[edit]- 21:16 milimetric: tried to deploy AQS but it failed with the same integration test on mediarequests, sending email
2020-02-26
[edit]- 15:06 ottomata: dropped and re-added backfilled partitions on event.CentralNoticeImpression table to propogate schema alter on main table - T244771
- 09:50 joal: Force delete old api/cirrus events from HDFS trash to free some space
2020-02-24
[edit]- 18:20 elukey: move report updater jobs from stat1007 to an-launcher1001
2020-02-22
[edit]- 14:21 elukey: restart hadoop-yarn-nodemanager on analytics1044 - broken disk, apply hiera overrides to exclude it
- 14:11 elukey: restart hadoop-yarn-nodemanager on analytics1073 - process died, logs saved in /home/elukey
2020-02-21
[edit]- 16:04 ottomata: altered event.CentralNoticeImpression table column event.campaignStatuses to type string, will backfill data - T244771
- 11:49 elukey: restart varnishkafka on various cp30xx nodes
- 11:41 elukey: restart varnishkafka on cp3057 (stuck in timeouts to kafka, analytics alarms raised)
- 08:19 fdans: deploying refinery
- 00:11 joal: Rerun failed wikidata-json_entity-weekly-coord instances after having created the missing hive table
2020-02-20
[edit]- 16:57 fdans: refinery source jars updated
- 16:39 fdans: deploying refinery source 0.0.114
- 15:16 fdans: deploying AQS
2020-02-19
[edit]- 16:58 ottomata: Deployed refinery using scap, then deployed onto hdfs
2020-02-17
[edit]- 18:29 elukey: reboot turnilo and superset's hosts for kernel upgrades
- 18:25 elukey: restart kafka on kafka-jumbo1001 to pick up new openjdk updates
- 18:22 elukey: restart cassandra on aqs1004 to pick up new openjdk updates
- 17:59 elukey: restart druid daemons on druid1003 to pick up new openjdk updates
- 17:58 elukey: restart cassandra on aqs1004 to pick up new openjdk updates
- 17:56 elukey: restart hadoop daemons on analytics1042 to pick up new openjdk updates
2020-02-15
[edit]- 12:07 elukey: re-run failed pageview druid hour
- 12:05 elukey: re-run failed virtualpageview hours
2020-02-12
[edit]- 14:33 elukey: restart hue on analytics-tool1001
- 13:36 joal: Kill-restart webrequest bundle to see if it mitigates the error
2020-02-10
[edit]- 15:26 elukey: kill application_1576512674871_246621 (consuming too much memory)
- 14:31 elukey: kill application_1576512674871_246419 (eating a ton of ram on the cluster)
2020-02-08
[edit]- 09:35 elukey: created /wmf/data/raw/wikidata/dumps/all_ttl on hdfs
- 09:35 elukey: created /wmf/data/raw/wikidata/dumps/all_json on hdfs
2020-02-05
[edit]- 21:14 joal: Kill data_quality_stats-hourly-bundle and data_quality_stats-daily-bundle
- 21:11 joal: Kill-restart mediawiki-history-dumps-coord, drop existing data, and restart at 2019-11
- 21:06 joal: Kill-restart mediawiki-wikitext-history-coord and mediawiki-wikitext-current-coord
- 20:51 joal: Deploy refinery using scap
- 20:29 joal: Refinery-source released in archiva by jenkins
- 20:20 joal: Deploy hdfs-tools 0.0.5 using scap
2020-02-03
[edit]- 11:20 elukey: restart oozie on an-coord1001
- 10:11 elukey: enable all timers on an-coord1001 after spark encryption/auth settings
- 09:32 elukey: roll restart yarn node managers again to pick up spark encryption/authentication settings
- 08:34 elukey: stop timers on an-coord1001 to drain the cluster and ease the deploy of spark encryption settings
- 07:58 elukey: roll restart hadoop yarn node managers to pick up new libcrypto.so link (shouldn't be necessary but just in case)
- 07:24 elukey: create /usr/lib/x86_64-linux-gnu/libcrypto.so on all the analytics nodes via puppet
2020-01-27
[edit]- 05:38 elukey: re-run webrequest text 2020-01-26T20/21 with higher dataloss thresholds (false positives)
- 02:49 elukey: re-run refine eventlogging manually to clear out refine failed events
2020-01-26
[edit]- 17:58 elukey: re-run failed refine job for MobileWebUIActionsTracking 2020-01-26T12
- 17:32 elukey: restart varnishkafka on cp3056/cp3064 due to network issues on the hosts
2020-01-23
[edit]- 17:48 milimetric: launching a sqoop for imagelinks (will be slow because tuning sess)
2020-01-20
[edit]- 12:19 elukey: restart zookeeper on an-conf100X to pick up openjdk-11 updates
2020-01-18
[edit]- 10:06 elukey: re-run all entropy job failed via Hue (StopWatch issue)
2020-01-16
[edit]- 20:52 mforns: deployed refinery accompanying source v0.0.112
- 17:00 mforns: deployed refinery-source v0.0.112
- 15:17 elukey: upgrade superset to 0.35.2
- 15:14 elukey: stop superset as prep step for upgrade
2020-01-15
[edit]- 10:44 elukey: remove flume-ng and spark-python/core packages from an-coord1001,analytics1030,analytics-tool1001,analytics1039 - T242754
- 10:39 elukey: remove flume-ng from all stat/notebooks - T242754
- 10:37 elukey: remove spark-core flume-ng from all the hadoop workers - T242754
- 08:44 elukey: move aqs to the new rsyslog-logstash pipeline
2020-01-14
[edit]- 20:12 milimetric: deployed aqs with new service-runner version 2.7.3
2020-01-13
[edit]- 21:45 milimetric: webrequest restarted
- 21:32 milimetric: killing webrequest bundle for restart
- 15:00 joal: Deploy hdfs-tools 0.0.3 using scap
- 14:24 joal: Releasing hdfs-tools 0.0.3 to archiva
- 12:54 elukey: restart hue to re-apply user hive limits (again)
2020-01-10
[edit]- 14:30 elukey: restart oozie with new settings to instruct it to pick up spark-defaults.conf settings from /etc/spark2/conf
- 07:38 elukey: re-run virtualpageviews-druid-daily 09/01/2020 via Hue
- 07:37 elukey: systemctl restart drop-el-unsanitized-events on an-coord1001
2020-01-09
[edit]- 11:17 moritzm: installing cyrus-sasl security updates
- 11:10 elukey: remove old accounts (user: absent) from Superset
- 10:30 elukey: revert hue's hive query limit and restart hue - T242306
- 07:45 elukey: re-run failed data-quality-stats-event.navigationtiming-useragent_entropy-hourly-coord 2020/01/09T00
- 07:33 elukey: kill test_elukey_webrequest_sampled_128 from druid
- 07:30 elukey: restart turnilo after updating the webrequest_sampled_128's config
2020-01-08
[edit]- 20:44 joal: Restart webrequest-load-bundle to update queue to production
- 20:17 joal: rerun edit-hourly-wf-2019-12 after having updated the underlying table <facepalm />
- 20:06 joal: Prepare and start learning-features-actor-hourly-coord
- 19:56 joal: kill wikidata-articleplaceholder_metrics-coord as it is buggy
- 19:56 joal: Kill-restart edit-hourly-coord and edit-hourly-druid-coord
- 19:48 joal: Kill-restart wikidata-articleplaceholder_metrics-coord
- 19:44 joal: Kill-restart mediawiki-history-load-coord, mediawiki-history-denormalize-coord, mediawiki-history-check_denormalize-coord, mediawiki-history-metrics-coord, mediawiki-history-reduced-coord, mediawiki-history-dumps-coord
- 19:42 joal: Kill-restart mediawiki-history-load-coord,
- 19:29 joal: Kill-restart webrequest-druid-daily-coord and webrequest-druid-hourly-coord after deploy
- 19:16 joal: Deploy refinery on HDFS
- 19:04 joal: Deploy refinery using scap
- 18:30 joal: Releasing refinery-0.0.110 to archiva using Jenkins
- 18:11 joal: AQS deployed with new druid datasource (2019-12)
- 17:52 joal: Rerun webrequest-load-wf-text-2020-1-8-15 with updated thresholds after frontend issue
2020-01-07
[edit]- 17:54 elukey: apt-get remove python3.5 on stat1005
- 15:16 elukey: re-enable timers on an-coord1001 after hive restart
- 15:03 elukey: restart hive (server+metastore) on an-coord1001 to apply delegation token settings
- 14:36 elukey: stop timers on an-coord1001 as prep step to restart hive
- 14:05 elukey: apply max cpu cores usage (via systemd cgroups) on stat/notebook
- 07:59 elukey: restart hue (again) with correct principal settings)
- 07:42 elukey: restart Hue after applying a new kerberos setting (hue_principal, was not specified before)
2020-01-06
[edit]- 16:45 joal: Manually sqoop missing tables (content,content_models,slot_roles,slots,wbc_entity_usage0
2020-01-02
[edit]- 18:32 elukey: restart hue with new hive query limits