Analytics/Server Admin Log/Archive/2018
Appearance
2018-12-22
[edit]- 18:47 elukey: manually clean up of old log files on an-coord1001 (disk space issues)
2018-12-21
[edit]- 22:55 mforns: Restarted Turnilo to clear a deleted test datasource
- 18:44 mforns: Restarted Turnilo to clear a deleted test datasource
2018-12-19
[edit]- 00:09 mforns: restarted Turnilo to clear deleted datasource
2018-12-18
[edit]- 23:50 mforns: restarted Turnilo to clear deleted datasource
- 20:34 ottomata: bounced eventlogging-processor to pick up change to send invalid rawEvents as json string
- 19:36 ottomata: re-running refine_eventlogging_backfill again for days in december - T211833
- 17:37 mforns: restarted Turnilo to clear deleted datasource
- 15:34 mforns: restarted Turnilo to clear deleted datasource
- 10:19 mforns: restarted Turnilo to clear deleted datasource
2018-12-17
[edit]- 23:52 mforns: restarted Turnilo to clear deleted datasource
- 22:50 mforns: restarted Turnilo to clear deleted datasource
- 22:31 mforns: restarted Turnilo to clear deleted datasource
- 15:23 ottomata: re-running refine_eventlogging_analytics with --ignore_done_flag (backfilling didn't complete properly on friday) - T211833
2018-12-14
[edit]- 20:45 ottomata: re-refining all hive EventLogging tables since 2018-11-29T17:00:00. - T211833
- 20:35 ottomata: removing EventLogging Hive _REFINED flag files since 2018-11-29T17:00:00 to allow for re-refinement of data - 2018-11-29T17:00:00
- 19:50 ottomata: staring refinery release deploy process for refinery 0.0.82 to fix T211833
2018-12-13
[edit]- 21:11 ottomata: superset is back up at version 0.26.3
- 20:57 ottomata: stopped superset on analytics-tool1003 for revert to previous version (luca will revert the db backup)
- 15:25 mforns: restarted turnilo to clean a deleted test datasource
2018-12-12
[edit]- 22:49 mforns: restarted turnilo to clear deleted test datasource
- 20:07 mforns: restarted turnilo to clear deleted test datasource
- 17:10 mforns: restarted turnilo to clear deleted test datasource
2018-12-11
[edit]- 16:03 mforns: restarted Turnilo to clear deleted datasource
- 15:27 mforns: restarted Turnilo to clear deleted datasource
- 14:58 joal: Restart clickstream job after having repaired hive mediawiki-tables partitions
2018-12-10
[edit]- 19:37 joal: Manually deleting old druid-public snapshots that were not following datasource naming convention (- instead of _)
- 14:51 milimetric: trying the labsdb/analytics-store combination sqoop, live logs in /home/milimetric/sqoop-[private-]log.log on stat1004
2018-12-07
[edit]- 08:10 joal: manually create /wmf/data/raw/mediawiki/tables/change_tag/snapshot=2018-11/_SUCCESS on hdfs to unlock mw-history-load and therefore mw-history-reduced
2018-12-06
[edit]- 15:08 elukey: turnilo migrated to nodejs 10
2018-12-05
[edit]- 14:53 elukey: restart hdfs namenodes and yarn rm to update rack awareness config (prep for new nodes)
- 11:58 fdans: backfilling in progress, killing uniques coordinators within bundle, will restart bundle on Jan 1st
- 11:34 fdans: backfill test successful. Starting job to backfill family uniques since mar 2017
- 10:03 fdans: backfilling test for unique project families - start_time=2016-01-01T00:00Z stop_time=2016-02-01T00:00Z
- 09:13 elukey: matomo read only + upgrade to matomo 3.7.0 on matomo1001
- 07:43 elukey: restart middlemanager/broker/historical on druid-public to pick up new log4j settings
2018-12-04
[edit]- 18:26 ottomata: reenabled refinement of mediawiki_revision_score
- 17:50 joal: Deploying aqs using scap for offset and underestimate values in unique-devices endpoints
- 17:12 elukey: cleanup logs on /var/log/druid on druid100[1-3] after change in log4j settings
- 15:25 elukey: rolling restart of broker/historical/middlemanager on druid100[1-3] to pick up new logging settings
- 15:01 joal: Update test values for uniques in cassandra before deploy
- 14:56 elukey: restart druid broker and historical on druid1001
- 12:16 joal: Drop cassandra test keyspace "local_group_default_T_unique_devices_TEST"
- 10:55 fdans: deploying AQS to expose offset and underestimate numbers on unique devices
2018-12-03
[edit]- 20:05 ottomata: dropping and recreating hive event.mediawiki_revision_score table and data - T210465
- 18:11 mforns: rerun webrequest upload load job for 2018-12-01T14:00
2018-12-01
[edit]- 08:50 fdans: bundle restarted successfully
- 08:39 fdans: killing current cassandra bundle
2018-11-30
[edit]- 12:45 joal: Update hive wmf_raw mediawiki schemas (namespace bigint -> int)
2018-11-29
[edit]- 18:33 mforns: Finished refinery deployment using scap and refinery-deploy-to-hdfs
- 17:41 mforns: Starting refinery deployment using scap and refinery-deploy-to-hdfs
- 17:37 mforns: Deployed refinery-source using jenkins
2018-11-26
[edit]- 15:47 ottomata: moved old raw revision-score data to hdfs in /user/otto/revision_score_old_schema_raw - T210013
- 15:41 ottomata: stopped producing revision-score events with old schema; merged and deployed new schema; petr to deploy change to produce events with new schema soon. https://phabricator.wikimedia.org/T210013
- 15:27 fdans: monthly and daily jobs for uniques killed, replaced with backfilling jobs until Dec 1st
2018-11-22
[edit]- 13:42 elukey: allow the research user to create/alter/etc.. tables on staging@db1108
2018-11-21
[edit]- 19:49 milimetric: deploying AQS
- 13:06 fdans: launching backfilling jobs for daily and monthly uniques from beginning of time until Nov 20
- 13:05 fdans: test backfill on 13 Nov daily uniques successful
- 12:54 fdans: testing backfill of daily uniques in production for 2018-11-13
2018-11-20
[edit]- 14:02 elukey: restart hive-server2 to pick up new settings - T209536
- 11:44 elukey: re-run pageview-hourly-wf-2018-11-20-9
2018-11-19
[edit]- 13:59 joal: failing deployment on aqs to include a new patch
- 13:41 joal: Deploying aqs using scap
- 13:27 fdans: deploying aqs to add new fields to uniques dataset (T167539)
2018-11-18
[edit]- 08:44 elukey: re-run webrequest-load-wf-text-2018-11-17-23 via Hue
- 08:37 elukey: restart yarn on analytics1039 - not clear why the process failed (nothing in the logs, no other disks failed)
2018-11-15
[edit]- 14:51 fdans: testing load of new uniques fields in test keyspace in cassandra
- 14:07 elukey: re-run mediacounts-load-wf-2018-11-15-8 - died due to issues on an1039 (happened this morning, broken disk)
2018-11-12
[edit]- 19:30 ottomata: running oozie-setup sharelib create and then spark2_oozie_sharelib_install
- 15:40 fdans: Restarting per project family unique generation jobs (daily and monthly)
- 13:18 joal: Suspend discovery 0060527-180705103628398-oozie-oozi-C coordinator for it not to block upgrade
2018-11-05
[edit]- 10:20 joal: Create hive tables wmf.webrequest_subset and wmf.webrequest_subset_tags
- 10:02 joal: Start mediawiki-history-wikitext job
- 09:58 joal: create wmf.mediawiki_wikitext_history table
- 09:46 joal: Alter wmf.pageview_whitelist renaming insertion_ts field to insertion_dt for convention
- 09:43 joal: restart mediawiki-load oozie bundle to pick new deploy
- 09:39 joal: Restart mediawiki-history-load oozie job to pick new deploy
- 09:37 joal: Create table wmf_raw.mediawiki_change_tag
- 09:24 joal: deploying refinery onto HDFSb
- 09:04 joal: Deploy refinery from scap
- 08:55 joal: Refinery-source released on archiva
2018-10-30
[edit]- 16:55 mforns: Finished AQS deployment using scap
- 16:45 mforns: Starting AQS deployment using scap
- 15:34 ottomata: kafka topics --alter --topic eventlogging_VirtualPageView --partitions 12
2018-10-29
[edit]- 22:55 ottomata: groceryheist killed a long running hive query that is now allowing backlogged production yarn jobs to finally execute
- 16:37 ottomata: reassigning eventlogging_ReadingDepth partition 0 from 1002,1004,1006 to 1003,1001,1005 to move preferred leadership from 1002 to 1003
- 14:27 ottomata: ran kafka-preferred-replica-election on kafka jumbo-eqiad cluster (this successfully rebalanced webrequest_text partition leadership) T207768
- 10:23 joal: Kill yarn application application_1540747790951_1429 to prevent more cluster errors (eating too many resources)
- 08:56 elukey: bounce yarn resource managers to pick up new zookeeper session timeout settings
2018-10-28
[edit]- 17:30 elukey: restart yarn resource manager on an-master1002 to force failover to an-master1001
2018-10-26
[edit]- 11:49 joal: Rerun failed oozie jobs (pageview and projectview)
- 06:18 elukey: add AAAA DNS records for aqs and matomo1001
- 05:55 elukey: reportupdater hadoop migrated to stat1007
2018-10-25
[edit]- 21:06 ottomata: bouncing eventlogging-processor client side* to pick up mysql whitelist change for ContentTranslationAbuseFilter (https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/469419/)
- 18:14 joal: Manually resume the bunch of suspended jobs (mostly from ebernhardson and chelsyx - our apologizes for not noticing earlier)
- 18:13 joal: Manually copy /etc/hive/conf/hive-site.xml to hdfs:///user/hive and set permissions to 644 to allow all users to run oozie jobs
- 15:36 elukey: shutdown aqs1006 to replace one broken disk
- 14:28 elukey: upgrade druid on druid100[4-6] to Druid 0.12.3
- 14:24 elukey: added AAAA DNS records to all the druid nodes
- 10:36 joal: Resuming oozie webrequest and pageview druid hourly indexation jobs
- 10:35 elukey: upgraded Druid on druid100[1-3] to 0.12.3-1
- 09:16 elukey: upgrade turnilo to 1.8.1
- 08:56 elukey: restart hive-server on an-coord1001 to pick up new prometheus settings
- 08:10 joal: Suspend webrequest-druid-hourly and pageview-druid-hourly oozie jobs
- 07:52 joal: Manually add za.wikimedia to pageview-witelist (patch merged: https://gerrit.wikimedia.org/r/469557)
2018-10-23
[edit]- 16:25 ottomata: altering topic eventlogging_ReadingDepth to increase partitions from 1 to 12
- 06:42 elukey: restart yarn and hdfs daemon on analytics1068 to pick up correct config (the host was down since before we swapped the Hadoop masters due to hw failures)
2018-10-22
[edit]- 17:24 elukey: upgraded camus jar version in an-coordq1001's crontab (via puppet)
- 17:21 elukey: deploy refinery to hdfs (via stat1005)
- 17:12 elukey: deploy refinery (new version of camus)
- 15:09 mforns: Finished deployment of refinery using scap and refinery-deploy-to-hdfs
- 14:51 mforns: Starting deployment of refinery using scap and refinery-deploy-to-hdfs
- 14:50 mforns: Finished deployment of refinery-source using jenkins
- 14:24 mforns: Starting deployment of refinery-source using jenkins
2018-10-16
[edit]- 12:32 joal: rerun pageview-hourly-wf-2018-10-15-17
2018-10-15
[edit]- 19:45 mforns: Finished refinery deployment with scap and refinery-deploy-to-hdfs
- 19:10 mforns: Started refinery deployment with scap and refinery-deploy-to-hdfs
- 19:09 mforns: Finished refinery-source deployment
- 18:42 mforns: Started refinery-source deployment
- 15:20 mforns: Finished refinery deployment with scap and refinery-deploy-to-hdfs
- 14:52 mforns: Started refinery deployment with scap
- 14:47 mforns: Finished refinery-source deployment
- 14:19 mforns: Started refinery-source deployment
- 14:05 elukey: swapped cobalt's ip with gerrit.wikimedia.org's one in analytics-in(4|6) firewall filters on the eqiad routers for https://phabricator.wikimedia.org/T206331#4666622. This should not cause git pulls to fail but let me know in case it does.
2018-10-14
[edit]- 09:15 elukey: restart yarn resource manager on an-coord1002 (failover happened due to jvm issues)
- 09:15 elukey: restart apps-session-metrics with spark 2.3.1 oozie libs (modified the coordinator.properties file manually on disk)
2018-10-12
[edit]- 07:32 elukey: cleaned up all september files from eventlog1002's srv el archive to free some space (disk alerts)
2018-10-11
[edit]- 14:20 elukey: reboot eventlog1002 for kernel upgrades
2018-10-10
[edit]- 19:27 joal: Restart webrequest-load oozie bundle
- 18:23 joal: kill Webrequest-load bundle
- 18:04 joal: Kill webrequest-load-coord-upload
- 07:23 elukey: add ipv6 mapped addresses (and DNS PTRs) to analytics-tools*
- 07:23 joal: Full restart of browser-general oozie job
- 07:19 joal: patch mediacount-archive job in prod
- 07:16 joal: Full restart of mediacount-archive oozie job
- 05:54 elukey: re-run failed mediacounts and browser-general coordinators with hive-site -> hdfs://analytics-hadoop/user/hive/hive-site.xml
2018-10-09
[edit]- 18:24 ottomata: adding Accept header to all varnishkafka generated webrequest logs
- 15:10 joal: restart Mediawiki-history-reduced
- 15:08 joal: restart wikidata-coeditors oozie job
- 15:08 joal: restart wikidata-specialentites oozie job
- 15:00 joal: restart wikidata-article-placeholder oozie job
- 14:57 joal: restart mediawiki-history denormalize oozie job
- 14:56 joal: Restart check_denormalize oozie job
- 14:53 joal: Restart clickstream oozie job to pick new spark-lib
- 13:56 ottomata: bouncing oozie server on an-coord1001
- 13:46 joal: Restarting oozie-api job
- 13:36 joal: fully restart projectview_geo oozier job
- 13:26 joal: Full restart of aqs oozie job
- 13:25 joal: full restart of projectview_hourly
- 13:14 joal: rerun failed aqs-hourl jobs
- 12:48 elukey: re-run all the failed projectview-hourly-coord and aqs-hourly-coord workflows (restarting them via hue)
- 12:47 elukey: re-run apis-wf-2018-10-9-8
- 10:01 joal: Restart failed oozie jobs (webrequest, virtual-pageviews, mwh-reduced)
- 07:14 elukey: stopped all crons on analytics1003 as prep step for migration to an-coord1001
2018-10-08
[edit]- 16:28 elukey: restart eventlogging on eventlog1002 for python security upgrades
- 10:26 elukey: swapped db settings from analytics1003 to an-coord1001 on both Druid clusters (restarted coordinators and overlords)
- 07:35 joal: Manually run download-project-namespace-map with proxy
2018-10-06
[edit]- 18:10 elukey: restart Yarn Resource Manager on an-master1002 to force an-master1001 to take the active role back (failed over due to a zk conn issue)
2018-10-05
[edit]- 10:32 elukey: piwik/matomo out of maintenance
- 10:17 elukey: set piwik/matomo in maintenance mode on matomo1001
2018-10-04
[edit]- 20:33 mforns: Finished deployment of refinery
- 19:52 mforns: Started deployment of refinery
- 19:50 mforns: Finished deployment of refinery-source
- 19:22 mforns: Started deployment of refinery-source
- 17:20 elukey: bounce druid-brokers on druid100[4-6] after network maintenance
2018-10-01
[edit]- 12:56 fdans: reverting to last version of wikistats
2018-09-27
[edit]- 06:44 elukey: rolling restart of Druid coordinators and historicals on the Druid public cluster to pick up new Hadoop masters (one at the time, very gently)
2018-09-26
[edit]- 20:39 elukey: rolling restart of all the druid historicals on Druid private/analytics
- 20:00 ottomata: rolling restart of druid coordinators to hopefully pick up hadoop master config change
- 17:49 joal: Deploy AQS from scap
- 08:22 elukey: start mysql consumers on eventlog1002 after maintenance
- 07:51 elukey: stop mysql consumers on eventlog1002 as prep step for db maintenance
2018-09-25
[edit]- 20:21 joal: Webrequest warning for upload-2018-09-25-13 were all false positives
- 17:36 ottomata: stopping refine jobs and deploying refinery source 0.0.75 - T203804
- 12:37 joal: Rerun webrequest-load-wf-text-2018-9-25-6 and webrequest-load-wf-text-2018-9-25-7 after SLA failure due to hadoop master swaps
- 11:55 joal: Rerun webrequest-load-wf-upload-2018-9-25-6 after failed SLA during hadoop master swap
- 11:53 joal: rerun as you prefer dcausse :)
- 08:02 joal: Killing discovery transfer job to drain cluster before master replacement (application_1536592725821_38136)
- 06:24 elukey: stop camus crons on an1003 and report updater on stat1005 as prep step for cluster shutdown
2018-09-20
[edit]- 16:04 joal: webrequest-load-check_sequence_statistics-wf-text-2018-9-19-20 have been checked as false-positive
2018-09-15
[edit]- 12:22 joal: Restart webrequest-druid-[hourly|daily] coordinators
- 12:20 joal: Kill wikidata-wdqs coordinator
- 12:11 joal: Killing and restarting webrequest-load-bundle
- 12:00 joal: Deploying refinery onto hadoop :)
- 11:39 joal: Deploying refinery with scap
2018-09-12
[edit]- 17:34 ottomata: deploying new version of refinery-source, and then refinery for properties based RefineMonitor job - https://phabricator.wikimedia.org/T203804
- 13:11 ottomata: otto@deploy1001 Started deploy [eventlogging/analytics@5c6fab6]: Support loading plugins in eventlogging-processor - T203596
- 06:21 elukey: re-run webrequest-load-wf-text-2018-9-12-4, failed due to sql exceptions/timeouts to the database
2018-09-10
[edit]- 16:26 ottomata: restarting eventlogging-processors to pick up blacklist of WebClientError schema for MySQL - T203814
- 12:49 elukey: disable camus as prep step for analytics100[1-3] reboots
- 07:54 joal: Manually restarting mediawiki-reduced oozie with manual addition of missing parameter
2018-09-07
[edit]- 18:18 joal: Manually downoad namespaces for 2018-08
- 17:32 joal: Manually rerun download-project-namespace-map on analytics1003 after cron's failure
2018-09-06
[edit]- 13:03 fdans: restarted virtualpageview_hourly coordinator
2018-09-05
[edit]- 18:18 ottomata: restarted eventlogging processors blacklisting CentralNoticeImpression - T203592
- 16:56 ottomata: restarting eventlogging processors to blacklist CitationUsage - T191086
- 14:42 elukey: deploying refinery (pageview whitelist and cron script change)
- 13:40 ottomata: reimaging thorium to debian stretch (this will cause an announced {stats,analytics}.wm.org downtime!) - T192641
- 13:21 fdans: restarting webrequest load bundle, start time 11:00Z
- 09:12 elukey: re-run webrequest-druid-hourly-wf-2018-9-5-7 - failed due to rebooting druid1001
- 07:02 elukey: restart oozie on analytics1003 to pick up new smtp settings
- 06:37 elukey: re-run webrequest-load-wf-misc-2018-9-5-2 and webrequest-load-wf-upload-2018-9-4-19 via Hue
- 06:35 elukey: upload new pageview whitelist to hdfs
2018-09-04
[edit]- 19:05 joal: Restart cassandra-hourly-wf-local_group_default_T_pageviews_per_project_v2-2018-9-4-14
- 16:37 fdans: restarting webrequest-load bundle
- 16:07 fdans: beginning refinery deployment
- 14:28 fdans: deployed refinery source using jenkins
2018-09-03
[edit]- 08:07 joal: Delete /wmf/data/raw/mediawiki_private/tables/cu_changes/month=2018-05 folder and relaunch mediawiki-geoeditors-load-wf-2018-08
- 06:05 elukey: re-run virtualpageview-hourly-wf-2018-9-2-1 via Hue (failed oozie job to inspect: 0082016-180705103628398-oozie-oozi-W)
2018-08-31
[edit]- 14:31 elukey: re-run webrequest-load-wf-upload-2018-8-31-11, failed due to hadoop workers reboots
- 10:05 elukey: re-run webrequest-load-wf-upload-2018-8-31-7, failed due to hadoop workers reboots
- 09:15 elukey: re-run webrequest-load-wf-upload-2018-8-31-[7,8], failed due to hadoop workers reboots
- 07:26 elukey: re-run webrequest-load-wf-text-2018-8-31-4, failed due to hadoop workers reboots
- 06:20 elukey: re-run webrequest-load-wf-text-2018-8-31-4, failed for hadoop workers reboots
2018-08-30
[edit]- 15:59 elukey: rerun of pageview-druid-hourly-wf-2018-8-30-13, hadoop worker reboots in progress
- 15:23 elukey: re-run webrequest-load-wf-upload-2018-8-30-13, failed due to hadoop worker reboots
- 14:49 elukey: re-run webrequest-load-wf-text-2018-8-30-12, failed due to worker nodes reboots
2018-08-29
[edit]- 15:27 joal: Deploy AQS with scap
- 13:59 ottomata: upgrading spark2 package with pyarrow dependency and default pyspark to python3
- 11:39 joal: Restart mediwiki-history-reduced oozie job after deploy
- 10:41 elukey: nuked /srv/deployment/analytics/refinery on stat1005 after errors with archiva/git-fat (stat1005 is the canary)
- 10:34 joal: Deploying refinery onto HDFS
- 08:53 joal: Deploying refinery using scap
- 08:30 joal: Deploying refinery using jenkins
2018-08-28
[edit]- 14:19 joal: Restart Workflow pageview-druid-hourly-wf-2018-8-28-11
- 13:00 joal: Restart mediawiki-history and mediawiki-history-druid jobs
- 10:59 joal: deploying refinery onto HDFS
- 10:44 joal: Deploying refinery from scap
- 10:22 joal: Refinery-source v0.0.71 deployed onto archiva
- 08:14 joal: Restart virtualpageview-hourly-wf-2018-8-27-21
2018-08-20
[edit]- 21:28 fdans: restarting virtualpageview-hourly-coord
- 21:21 fdans: refinery deployment succeeded
- 20:55 fdans: deploying analytics refinery
- 18:18 ottomata: restaring eventlogging client side processes using librdkafka1 0.11.x - https://phabricator.wikimedia.org/T200769
2018-08-14
[edit]- 23:07 fdans: starting deployment of refinery via scap
- 19:35 mforns: Finished deployment of refinery using scap and refinery-deploy-to-hdfs
- 18:57 mforns: Starting deployment of refinery using scap and refinery-deploy-to-hdfs
- 18:44 mforns: Finished deployment of refinery using scap and refinery-deploy-to-hdfs
- 18:03 mforns: Starting deployment of refinery using scap and refinery-deploy-to-hdfs
- 17:29 mforns: Finished deployment of refinery-source using jenkins
- 15:51 ottomata: removed /srv/geowiki from stat1006
- 14:39 mforns: Starting deployment of refinery-source using jenkins
2018-08-13
[edit]- 14:59 ottomata: deploying refinery-0.0.69 and refinery changes for T198908
2018-08-10
[edit]- 14:52 ottomata: restarting eventlogging-consumer@mysql-eventbus consuming from kafka jumbo-eqiad - T201420
2018-08-09
[edit]- 17:49 mforns: finished refinery deploy using scap and refinery-deploy-to-hdfs
- 17:24 mforns: starting refinery deploy using scap
- 17:23 mforns: finished refiery-source deploy using jenkins
- 16:46 mforns: starting refiery-source deploy using jenkins
2018-08-08
[edit]- 21:29 joal: Webrequest data-loss warnings for upload and text for hours 2018-08-08-18 were contained only false positive (possibly related to network glitch ?)
2018-08-07
[edit]- 13:24 joal: Update AQS druid datasource to 2018-07 snapshot
2018-08-06
[edit]- 19:11 ottomata: upgrading from spark 2.3.0 -> spark 2.3.1 everywhere
- 12:21 joal: Warning in webrequest-upload-2018-8-1-13 contained only false positives
2018-08-02
[edit]- 17:33 milimetric: deployed refinery, relaunching geoeditors sqoop
2018-08-01
[edit]- 16:23 mforns: deploying refinery using scap
- 10:06 elukey: restart all the yarn nodemanagers after minor max memory allocation change
- 09:19 elukey: restart webrequest-load-wf-text-2018-8-1-7 (died due to yarn restarts)
- 06:59 elukey: restart eventlogging on eventlog1002 to pick up new logging settings
2018-07-28
[edit]- 17:29 elukey: restart eventlogging on eventlog1002 after tons of disconnects (still not clear what happened)
2018-07-27
[edit]- 15:18 joal: Deploying AQS with scap
- 11:42 joal: Restart mediawiki-history-denormalize oozie job after deploy
2018-07-26
[edit]- 17:09 joal: Restart mediawiki-history-reduced job after deploy
- 14:03 joal: Restart webrequest-bundle load job to pick new pageview definition
- 13:59 joal: Start wikidata-coeditors job
- 07:57 joal: Deploying refinery with scap - 2nd try
2018-07-25
[edit]- 15:42 joal: Deploying refinery onto HDFS
- 14:38 joal: Release refinery v0.0.67 to archiva
2018-07-24
[edit]- 11:46 joal: Cheked that oozie webrequest upload warning for hour 2018-07-24-07 contains only false positive
2018-07-18
[edit]- 08:48 elukey: re-run hour 7 of webrequest upload/text via Hue (failed due to a hadoop node restart)
2018-07-10
[edit]- 10:43 elukey: restart map reduce history server on an1001 as attempt to see if related with yarn.w.o unresponsiveness
- 10:03 elukey: bounce yarn RM on an100[12], some socket errors after the ip6 interface rollout
- 08:20 joal: Update AQS druid backend datasource to 2018-06
2018-07-05
[edit]- 10:36 elukey: restart oozie on analytics1003 - connection timeouts from thorium after mariadb maintenance
- 10:34 elukey: restart hive metastore on an1003, errors after mariadb maintenance this morning
- 07:44 elukey: all jobs re-enabled
- 06:26 elukey: stop camus to allow mariadb restart on analytics1003
2018-07-02
[edit]- 14:56 elukey: resume cassandra bundle via hue
- 13:27 elukey: suspend cassandra bundle via Hue to ease the reimage of aqs1004
- 09:12 joal: Rerun mediawiki-geoeditors-load-wf-2018-06 after having fixed the wmf_raw.mediawiki_private_cu_changes table issueb
- 07:12 joal: Restart cassandra bundle
2018-06-28
[edit]- 14:46 elukey: upgrade piwik 3.2.1 to matomo (new name/package) 3.5.1
- 11:27 joal: Change mediawiki-reduced table format to be parquet and restart mediawiki-reduced oozie job
- 11:19 joal: Restart druid uniques daily-monthly-aggregated indexation jobs
- 11:19 joal: Start backfilling job cassandra pageviews-top-countries ceiled-values
- 10:20 joal: Deploying refinery to HDFS
- 10:09 joal: Deploying refinery using scap
- 09:03 joal: deploying AQS pageviews-bycountry ceiled value glue code
- 07:41 fdans: testing load of 2 months of per country pageviews with the new ceiled value
- 06:10 elukey: move /srv/kafka to a dedicated 60G partition on deployment-jumbo hosts in deployment-prep
2018-06-27
[edit]- 21:51 elukey: piwik maintenance completed
- 13:08 elukey: piwik upgraded to 3.2.1 on bohrium + started the db migration procedure (will last 2/3h probably)
- 12:57 elukey: set Piwik in maintenance mode as prep step for backup + upgrade
2018-06-20
[edit]- 19:54 ottomata: removed Kafka MirrorMaker from kafka10(12|13|14)
2018-06-18
[edit]- 11:57 joal: Restart oozie webrequest refine jobs
- 11:19 joal: Launch oozie webrequest refine jobs for the failing hour 2018-06-14-11
- 10:18 joal: Deployed refiney on hdfs
- 10:18 joal: Deployed refinery with scap
2018-06-15
[edit]- 09:00 joal: Deleting corrupted file hdfs://analytics-hadoop/user/joal/wmf/data/raw/webrequest/webrequest_upload/hourly/2018/06/14/11/webrequest_upload.1004.10.1214791.15490650727.1528974000000._COPYING_ to prevent webrequest refine jobs from failing. No data will be lost as the correct file exist.
2018-06-14
[edit]- 19:29 joal: try rerunning webrequest-load-wf-upload-2018-6-14-11
- 13:14 elukey: re-run failed webrequest-upload/text jobs (namenodes restarted)
2018-06-11
[edit]- 13:56 ottomata: bouncing eventlogging processes to apply kafka event time producing
2018-06-08
[edit]- 11:45 joal: Launching manual sqooping of revision and archive table to recover from failure
2018-06-01
[edit]- 08:37 joal: Restart every druid loading oozie job (except mediawiki reduced) to pick new configuration
- 08:33 joal: Restart mediawiki-history-denormalize oozie job after deploy
- 08:24 joal: Deploy refinery on HDFS
- 08:08 joal: Deploying refinery using scap
- 07:53 joal: Releasing refinery-source v0.0.65 to archiva
- 07:05 joal: Rerun virtualpageview-druid-monthly-wf-2018-5
2018-05-31
[edit]- 17:01 ottomata: dropping and deleting MobileWikiAppiOS* tables and data per request from chelsyx
- 10:51 elukey: stopped Pivot on thorium
- 07:27 joal: Restart webrequest-load-bundle with default oozie_launcher_memory value (should be 2048 set by workflows)
- 05:33 elukey: re-run faied webrequest-load upload|misc jobs via Hue
- 01:02 ottomata: bouncing main-eqiad -> jumbo-eqiad mirror maker
2018-05-30
[edit]- 17:49 joal: Rerun webrequest-load-wf-misc-2018-5-30-16
- 13:15 elukey: re-run webrequest-load-wf-upload-2018-5-30-11 - died after worker node reboots
- 06:14 elukey: re-run failed webrequest-load jobs
- 06:11 elukey: temporary point Turnilo to druid1002 to allow druid1001's reimage
- 05:50 elukey: restart mirror maker on kafka10[12-23] - failures to consume after rebalance
2018-05-29
[edit]- 17:02 elukey: re-run webrequest-load-text 29th May 2018 12:00:00
- 15:03 joal: rerun webrequest-load-wf-upload-2018-5-29-13
- 10:30 elukey: roll restart of druid-middlemanagers on druid* to pick up the new runtime settings (no more references to hadoop-client-cdh)
- 10:04 elukey: re-run pageview-druid-hourly-wf-2018-5-29-7
- 07:05 elukey: re-run webrequest-load-wf-text-2018-5-29-1
2018-05-28
[edit]- 18:51 elukey: rerun webrequest-load-wf-upload-2018-5-28-14
- 18:16 elukey: restart kafka mirror maker on kafka1012->14 - failed after the last round of kafka restarts
- 12:55 elukey: re-run webrequest-load-wf-misc-2018-5-28-10
- 05:50 elukey: re-run webrequest-load-wf-misc-2018-5-27-22, webrequest-load-wf-text-2018-5-28-2, webrequest-load-wf-upload-2018-5-28-3
2018-05-27
[edit]- 07:25 joal: Rerun webrequest-load-wf-upload-2018-5-25-23
- 07:25 joal: rerun webrequest-load-wf-misc-2018-5-26-16 and webrequest-load-wf-misc-2018-5-27-0
2018-05-25
[edit]- 06:53 elukey: re-run webrequest-load-wf-upload-2018-5-24-23 and webrequest-load-wf-text-2018-5-25-4
2018-05-24
[edit]- 17:20 ottomata: dropped and deleted raw and refined eventlogging tables and data for MobileWikiAppiOSUserHistory, MobileWikiAppiOSLoginAction, MobileWikiAppiOSSettingAction, MobileWikiAppiOSReadingLists, MobileWikiAppiOSSessions
- 16:45 joal: Rerun webrequest-druid-daily-wf-2018-5-23 to correct corrupted data
- 08:17 elukey: increase webrequest replication to 2 in druid analytics (via coordinator's UI)
- 08:16 joal: rerun webrequest-load-wf-misc-2018-5-24-6
2018-05-23
[edit]- 14:25 ottomata: redirecting pivot -> turnilo.wikimedia.org - https://phabricator.wikimedia.org/T194427
- 07:35 elukey: upgrading the Druid labs cluster to Debian Stretch
- 06:14 elukey: re-run webrequest-load-wf-misc-2018-5-23-2 via Hue
2018-05-22
[edit]- 15:38 elukey: re-run webrequest-druid-hourly-wf-2018-5-22-12 - failed due to Druid cluster upgrade in progress
- 14:08 elukey: upgrade druid on druid100[1-3] to 0.11
- 13:37 elukey: killed banner impression data job (application_1523429574968_110796) and removed its related respawn cron on an1003
- 09:43 elukey: upload to HDFS a new pageview whitelist to include fdc.wikimedia - https://gerrit.wikimedia.org/r/434370
- 06:54 elukey: upload Fran's pageview whitelist change to HDFS - related code change: https://gerrit.wikimedia.org/r/#/c/434370/ (also includes mai.wikimedia)
- 06:45 elukey: add nyc.wikimedia to the pageview whitelist on HDFS - related code change: https://gerrit.wikimedia.org/r/434440
2018-05-21
[edit]- 21:24 ottomata: granted User:CN=kafka_fundraising_client read permissions for group fundraising* on kafka-jumbo (for kafkatee webrequest consumption: kafka acls --add --allow-principal User:CN=kafka_fundraising_client --consumer --topic '*' --group 'fundraising*'
- 19:16 ottomata: restarted eventlogging file log consumers with new consumer groups beginning at end of topic
- 18:46 ottomata: restarting eventlogging with python-ua-parser 0.8.0
- 16:46 fdans: deploying refinery
- 14:40 fdans: Deploying refinery-source v0.0.64 using Jenkins
- 01:20 ottomata: bouncing main -> jumbo MirrorMaker with increased max.request.size - T189464
2018-05-16
[edit]- 17:45 milimetric: refinery deploy is done
- 16:21 milimetric: deploying refinery
2018-05-15
[edit]- 16:35 milimetric: finished deploying refinery, cron for dropping old mediawiki snapshots should now be good
- 16:20 milimetric: deploying refinery to fix that partition drop cron
- 16:01 joal: Deploy AQS using scap
- 15:55 ottomata: bouncing main -> analytics MirrorMaker
- 10:38 joal: Kill-Restart mediawiki-history-reduced ooie coordinator to pick up deployed changes
- 09:37 joal: Deploy refinery onto HDFS
- 09:36 joal: Deployed refinery using scap
2018-05-14
[edit]- 23:54 ottomata: bouncing main -> jumbo MirrorMaker with larger max.request.size
- 22:39 ottomata: bouncing main-eqiad -> jumbo mirror maker after committing new offset for eqiad.mediawiki.job.RecordLintJob
- 17:27 ottomata: enabling main-eqiad job topics -> jumbo mirroring
- 14:49 milimetric: deployment of refinery done
- 14:07 milimetric: deploying refinery to enable dropping cu_changes data
2018-05-11
[edit]- 14:25 elukey: restarted hadoop namenodes/resourcemanagers to apply openjdk security upgrades
2018-05-10
[edit]- 14:11 elukey: re-enabled camus after analytics1003's maintenance
- 13:08 elukey: disabled all camus jobs to drain the cluster and allow hive/oozie restarts for jvm upgrades
2018-05-09
[edit]- 16:56 ottomata: disabled 0.9 MirrorMaker on kafka102[023], enabled 1.x MirrorMaker on kafka-jumbo*
- 14:41 milimetric: finished deploying refinery with proper geoeditors druid indexing template
- 13:59 ottomata: beginning upgrade of Kafka main-eqiad cluster from 0.9.0.1 to 1.1.0 - T167039
- 13:49 milimetric: deploying refinery again, forgot to index a new metric in the new datasource, sorry
- 13:23 mforns: re-run webrequest-load-wf-misc-2018-5-9-12 via hue
- 13:13 milimetric: deployed refinery
- 12:58 milimetric: deploying very simple change just to rename druid datasource
- 12:48 elukey: re-run webrequest-load-wf-text-2018-5-8-17 via hue
2018-05-08
[edit]- 20:35 milimetric: refinery deploy complete
- 20:18 milimetric: deploying geoeditors for real now
- 20:12 milimetric: aborting deployment, will deploy data truncation script too
- 20:08 milimetric: deploying refinery to relaunch geoeditors job
- 17:57 joal: Mvoe recomputed 2018-03 history snapshot in place of old one (T194075)
- 15:38 joal: Try again (last time) to rerun mediawiki-history-druid-wf-2018-04
- 15:06 ottomata: beginnng Kafka upgrade of main-codfw: T167039
- 08:01 elukey: removed cassandra-metrics-collector (graphite) from aqs nodes
- 07:42 joal: Rerun mediawiki-history-druid-wf-2018-04 in a non-sync way with mediawiki-reduced
- 06:41 elukey: rolling restart of druid-historicals on druid100[456] due to half of the segments not avaiable
2018-05-07
[edit]- 12:05 joal: Rerun mediawiki-history-reduced-wf-2018-04
- 09:18 elukey: re-run webrequest-load-wf-text-2018-5-7-7 - failed due to reimages
2018-05-04
[edit]- 10:11 elukey: d-[123] Druid cluster upgraded to 0.11 in labs (project analytics)
2018-05-03
[edit]- 20:29 milimetric: fixed wikimetrics issues, working fine again
- 19:19 milimetric: wikimetrics is partly broken until I can figure out what’s going on
2018-05-02
[edit]- 17:33 joal: Rerun webrequest-load-wf-text-2018-5-2-15
- 16:41 joal: Manually silence pageview-whitelist alarm overwriting /wmf/refinery/current/static_data/pageview/whitelist/whitelist.tsv
- 16:27 joal: 2018-05-02T14 webrequest dataloss warnings have been checked and are false positives
- 16:17 joal: Restart oozie mediawiki-history-denormalize job after deploy
- 16:14 ottomata: bounced eventlogging-consumer@mysql-m4-master-00 after kafka jumbo 1.1.0 upgrade
- 16:05 joal: Restart oozie webrequest bundle after deploy
- 15:20 joal: Deploying refinery to hadoop
- 14:45 joal: Deploying refinery using Scap
- 14:16 joal: Refinery-source version 0.0.63 finally released to Archiva!
- 13:49 ottomata: beginning upgrade of kafka-jumbo brokers from 1.0.0 -> 1.1.0 : T193495
- 13:20 elukey: restart druid broker on druid100[1-3] to enable druid.sql.enable: true
2018-05-01
[edit]- 15:33 elukey: restart historical on druid1003 - exceptions in the logs
- 15:22 elukey: restart druid-historical on druid1002 - Caused by: java.lang.IllegalArgumentException: Could not resolve type id 'hdfs' into a subtype of
- 11:44 joal: False positive only in webrequest-load-check_sequence_statistics-wf-upload-2018-5-1-6
- 07:14 joal: Rerun webrequest-druid-daily-wf-2018-4-30
- 06:24 elukey: roll restart of all middlemanagers on druid100[123] - realtime tasks piled up from hours
2018-04-30
[edit]- 23:04 ottomata: blacklisting change-prop and job queue topics from main-eqiad -> analytics (eqiad)
- 22:55 ottomata: bouncing kafka main-eqiad -> eqiad (analytics) mirror maker
- 19:34 joal: Retry releasing refinery-source to archiva
- 18:43 joal: Releasing refinery-source
- 15:53 joal: Resume webrequest-druid-hourly-coord and pageview-druid-hourly-coord
- 14:23 joal: Suspend webrequest-druid-hourly-coord and pageview-druid-hourly-coord before druid upgrade
- 14:23 elukey: disabled cron/check on analytics1003 to respawn banner impressions if needed
- 14:21 joal: Kill BannerImpressionStream job before upgrading druid
2018-04-25
[edit]- 14:39 elukey: re-enable camus after maintenance
- 14:37 elukey: restart hive-server2 on analytics1003 to pick up settings in https://gerrit.wikimedia.org/r/428919
- 13:40 elukey: stop camus on an1003 as prep step to gracefully restart hive server
- 12:24 joal: Only false positive for Data Loss Warning - Workflow webrequest-load-check_sequence_statistics-wf-upload-2018-4-25-10
2018-04-24
[edit]- 16:30 elukey: restart hadoop hdfs journalnode on analytics1035/52 to pick up prometheus jmx settings
- 14:41 elukey: restart hadoop hdfs journalnode on analytics1028 to pick up jmx settings
- 12:08 elukey: restart webrequest-load-wf-text-2018-4-24-9 via Hue (failed due to reimages)
- 06:57 joal: correct reindextion job: https://hue.wikimedia.org/oozie/list_oozie_coordinator/0033859-180330093100664-oozie-oozi-C/
- 06:55 joal: Reindextion job: https://hue.wikimedia.org/oozie/list_oozie_coordinator/0033855-180330093100664-oozie-oozi-C/
- 06:54 joal: Manually reindexing all of mediawiki-history for snapshot 2018-03 after having messed it with job testing
2018-04-23
[edit]- 20:41 milimetric: deployed a version of wikistats with all but reading metrics disabled to stop showing bad data
- 19:34 elukey: deploy https://gerrit.wikimedia.org/r/428331 for Pivot
- 14:10 ottomata: switching main -> analytics MirrorMaker to --new.consumer (temporarily stopping puppet on kafka101[234]) https://phabricator.wikimedia.org/T192387
- 13:54 elukey: reimage analytics1067 to debian stretch
2018-04-20
[edit]- 18:23 joal: Drop/recreate wmf.mediawiki_user_history andwmf.mediawiki_page_history for T188669
- 14:17 elukey: d-[1,2,3] hosts in the analytics labs project upgraded to druid 0.10
- 11:37 fdans: manually uploaded refinery whitelist to hdfs
- 11:33 elukey: reimage analytics1068 do Debian stretch
2018-04-19
[edit]- 20:39 milimetric: launched virtual pageviews job, it has id 0026169-180330093100664-oozie-oozi-C
- 20:36 milimetric: Synced latest refinery version to HDFS
- 17:35 fdans: refinery deployment - sync to hdfs finished
- 16:27 elukey: analytics1069 reimaged to Debian stretch
- 15:40 fdans: deploying refinery
- 14:30 elukey: disabled druid1001's middlemanager, restarted 1002's
- 14:19 elukey: add 60G /srv partition to hadoop-coordinator-1 in analytics labs
- 14:04 elukey: disabled druid1002's worker as prep step for restart - jvms with a old version running realtime indexation
2018-04-16
[edit]- 10:04 joal: Restart metrics job after table update
- 09:54 joal: Update wmf.mediawiki_metrics table for T190058
- 08:41 joal: Restart Mediawiki-history job after new patches
- 08:35 joal: Restarting wikidata-articleplaceholder oozie job after last week's failures
- 08:29 joal: Deploying refnery onto HDFS
- 08:22 joal: Deploying refinery from tin
- 08:03 joal: Correction - Deploying refinery-source v0.0.62 using Jenkins !
- 08:03 joal: Deploying refinery source v0.0.62 from tin
2018-04-12
[edit]- 20:34 ottomata: replacing references to dataset1001.wikimedia.org:: with /srv/dumps in stat1005:~ezachte/wikistats/dammit.lt/bash: for f in $(sudo grep -l dataset1001.wikimedia.org *); do sudo sed -i 's@dataset1001.wikimedia.org::@/srv/dumps/@g' $f; done T189283
2018-04-11
[edit]- 16:48 elukey: restart hadoop namenodes to pick up HDFS trash settings
2018-04-10
[edit]- 22:43 joal: Deploying refinery with scap
- 22:42 joal: Refinery-source 0.0.61 deployed on archiva
- 20:43 ottomata: bouncing main -> jumbo mirrormakers to blacklist job topics until we have time to investigate more
- 20:38 ottomata: restarted event* camus and refine cron jobs, puppet is reenabled on analytics1003
- 20:14 ottomata: restart mirrormakers main -> jumbo (AGAIN)
- 19:26 ottomata: restarted camus-webrequest and camus-mediawiki (avro) camus jobs
- 18:18 ottomata: restarting all hadoop nodemanagers, 3 at a time to pick up spark2-yarn-shuffle.jar T159962
- 18:06 joal: EDeploy refinery to HDFS
- 17:46 joal: Refinery source 0.0.60 deployed to archiva
- 15:42 ottomata: disable puppet on analytics1003 and stop camus crons in preperation for spark 2 upgrade
- 14:25 ottomata: bouncing all main -> jumob mirror makers, they look stuck!
- 09:00 elukey: restart eventlogging mysql consumers on eventlog1002 to pick up new DNS changes for m4-master - T188991
2018-04-09
[edit]- 07:15 elukey: upgrade kafka burrow on kafkamon*
2018-04-06
[edit]- 17:14 joal: Launch manual mediawiki-history-reduced job to test memory setting (and index new data) -- mediawiki-history-reduced-wf-2018-03
- 13:39 joal: Rerun mediawiki-history-druid-wf-2018-03
2018-04-05
[edit]- 19:24 ottomata: upgrading spark2 to spark 2.3
- 13:43 mforns: created success files in /wmf/data/raw/mediawiki/tables/<table>/snapshot=2018-03 for <table> in revision, logging, pagelinks
- 13:38 mforns: copied sqooped data for mediawiki history from /user/mforns over to /wmf/data/raw/mediawiki/tables/ for enwiki, table: revision
2018-04-04
[edit]- 21:07 mforns: copied sqooped data for mediawiki history from /user/mforns over to /wmf/data/raw/mediawiki/tables/ for wikidatawiki and commonswiki, tables: revision, logging and pagelinks
- 16:06 elukey: killed banner-impression related jvms on an1003 to finish openjdk-8 upgrades (they should be brought back via cron)
2018-04-03
[edit]- 20:11 ottomata: bouncing main -> jumbo mirrormaker to apply batch.size = 65536
- 19:32 ottomata: bouncing main -> jumbo MirrorMaker unsetting http://session.timeout.ms/, this has a restiction on the broker in 0.9 :(
- 19:22 ottomata: bouncing main -> jumbo MirrorMaker setting session.timeout.ms = 125000
- 18:46 ottomata: restart main -> jumbo MirrorMaker with request.timeout.ms = 2 minutes
- 15:26 elukey: manually run hdfs balancer on an1003 (tmux session)
- 15:25 elukey: killed a jvm belonging to hdfs-balancer stuck from march 9th
- 13:48 ottomata: re-enable job queue topic mirroring from main -> eqiad
2018-04-02
[edit]- 22:28 ottomata: bounce mirror maker to pick up client_id config changes
- 20:55 ottomata: deployed multi-instance mirrormaker for main -> jumbo. 4 per host == 12 total processes
- 11:25 joal: Repair cu_changes hive table afer succesfull sqoop import and add _PARTITIONED file for oozie jobs to launch
- 08:33 joal: rerun wikidata-specialentitydata_metrics-wf-2018-4-1
2018-03-30
[edit]- 13:48 elukey: restart overlord+middlemanager on druid100[23] to avoid consistency issues
- 13:41 elukey: restart overlord+middlemanager on druid1001 after failures in real time indexing (overlord leader)
- 09:44 elukey: re-enable camus
- 08:26 elukey: stopped camus to drain the cluster - prep for easy restart of analytics1003's jvm daemons
2018-03-29
[edit]- 20:55 milimetric: accidentally killed mediawiki-geowiki-monthly-coord, and then restarted it
- 20:12 ottomata: blacklisted mediawiki.job topics from main -> jumbo MirrorMaker again, don't want to page over the weekend while this still is not stable. T189464
- 07:30 joal: Manually reparing hive mediawiki_private_cu_changes table after manual sqooping of 2018-01 data, and add _PARTITIONNED file to the folder
2018-03-28
[edit]- 19:39 ottomata: bouncing main -> jumbo mirrormaker to apply increase in consumer num.streams
- 19:21 milimetric: synced refinery to hdfs (only python changes but just so we have latest)
- 19:20 joal: Start Geowiki jobs (monthly and druid) starting 2018-01
- 18:36 joal: Making hdfs://analytics-hadoop/wmf/data/wmf/mediawiki_private accessible only by analytics-privatedata-users group (and hdfs obviously)
- 18:02 joal: Kill-Restart mobile_apps-session_metrics (bundle killed, coord started)
- 18:00 joal: Kill-Restart mediawiki-history-reduced-coord after deploy
- 17:44 joal: Deploying refinery onto hadoop
- 17:29 joal: Deploy refinery using scap
- 16:32 ottomata: bouncing main -> jumbo mirror makers to increase heap size to 2G
- 14:16 ottomata: re-enabling replication of mediawiki job topics from main -> jumbo
2018-03-27
[edit]- 14:03 elukey: consolidate all the zookeeper definition in one 'main-eqiad' one in Horizon -> Project-Analytics
- 11:16 elukey: kill banner impression job to force a respawn (still using an old jvm)
2018-03-26
[edit]- 15:12 elukey: restart eventlogging mysql consumers after maintenance
- 14:26 ottomata: restarting jumbo -> eqiad mirror makers with prometheus instead of jmx
- 13:28 ottomata: restarting kafka mirror maker main -> jumbo using new consumer
- 13:09 fdans: stopped 2 mysql consumers as precaution for T174386
2018-03-24
[edit]- 08:13 joal: kill failing query swamping the cluster (application_1520532368078_47226)
2018-03-23
[edit]- 16:44 elukey: invalidated 2018-03-12/13 for iOS data in piwik to force a re-run of the archiver
2018-03-20
[edit]- 10:10 elukey: removed old mysql/ssh/ganglia analytics vlan firewall rules (https://phabricator.wikimedia.org/T189408#4055749)
2018-03-19
[edit]- 09:38 elukey: restart hadoop daemons on analytics1070 for openjdk upgrades (canary)
2018-03-16
[edit]- 20:23 ottomata: bouncing main -> jumbo mirror makers to apply change-prop topic blacklist
- 14:44 ottomata: restarting eventlogging mysql eventbus consumer to consume from analytics instead of jumbo
- 14:38 elukey: temporary point pivot to druid1002 as prep step for druid1001's reboot
- 14:37 elukey: disable druid1001's middlemanager as prep step for reboot
- 14:24 elukey: changed superset druid private config from druid1002 to druid1003
- 13:43 elukey: disable druid1002's middle manager via API as prep step for reboot
- 09:57 elukey: restart eventlogging-consumer@mysql-m4/eventbus on eventlog1002 to force the DNS resolution of m4-master (changed from dbproxy1009 -> dbproxy1004)
2018-03-15
[edit]- 22:13 ottomata: bounced jumbo mirror makers
- 19:10 ottomata: bouncing main -> jumbo mirror maker
- 14:50 joal: Restart clickstream-coord to pick new config including fawiki
- 14:29 elukey: disabled druid1003's middlemanager as prep step for reboot
- 14:07 ottomata: bouncing kafka jumob -> eqiad mirrormaker
2018-03-14
[edit]- 15:27 ottomata: bouncing main -> jumbo mirror maker instances
- 14:45 ottomata: beginning migration of eventlogging analtyics from Kafka analytics to Kafka jumbo: T183297
2018-03-13
[edit]- 20:47 ottomata: restarting eventlogging processors to pick up VirtualPageView blacklist from eventlogging-valid-mixed topic
- 15:13 ottomata: bounce main -> analytics mirror maker instances
- 15:07 ottomata: bouncing MirrorMaker on kafka1020 (main -> jumbo) to re-apply acks=all
- 14:55 ottomata: bouncing MirrorMaker on kafka1022 to re-apply acks=all (main -> jumbo)
- 14:32 ottomata: bouncing MirrorMaker on kafka1023 (main -> jumbo) to re-apply acks=all
- 14:22 ottomata: bouncing mirrormaker for main -> analytics on kafka101[234] to apply roundrobin
2018-03-12
[edit]- 19:39 ottomata: deployed new Refine jobs (eventlogging, eventbus, etc.) with deduplication and geocoding and casting
- 18:17 ottomata: bouncing kafka mm eqiad -> jumbo witih acks=1
- 18:10 ottomata: bouncing kafka mirrormaker for main-eqiad -> jumbo with buffer.memory=128M
- 17:34 joal: Restart mediawiki-history-reduced oozie job to add a dependency
- 16:55 joal: Restart mobile_apps_session_metrics
- 16:52 joal: Deploying refinery on HDFS for mobile_apps patch
- 16:26 joal: Deploying refinery again to provide patch for mobile_apps_session_metric job
- 15:09 joal: Deploy refinery onto hdfs
- 15:07 joal: Deploy refinery from scap
- 14:32 elukey: restart druid-broker on druid1004 - no /var/log/druid/broker.log after 2018-03-10T22:38:52 (java.io.IOException: Too many open files_
- 08:50 elukey: fixed evenglog1002's ipv6 (https://gerrit.wikimedia.org/r/#/c/418714/)
2018-03-10
[edit]- 09:07 joal: Rerun clickstream-wf-2018-2
- 00:32 milimetric: finished sqooping pagelinks for missing dbs, hdfs -put a SUCCESS flag in the 2018-02 snapshot, jobs should run unless Hue is still lying to itself
2018-03-09
[edit]- 17:29 joal: Rerun mediawiki-history-reduced job after having manually repaired wmf_raw.mediawiki_project_namespace_map
2018-03-08
[edit]- 18:05 ottomata: bouncing ResourceManagers
- 08:54 elukey: re-enable camus after reboots
- 07:15 elukey: disable Camus on an1003 to allow the cluster to drain - prep step for an100[123] reboot
2018-03-07
[edit]- 07:15 elukey: manually re-run wikidata-articleplaceholder_metrics-wf-2018-3-6
2018-03-06
[edit]- 20:44 ottomata: reverted change to point mediawiki monolog kafka producers at kafka jumbo-eqiad until deployment train is done T188136
- 20:35 ottomata: pointing mediawiki monolog kafka producers at kafka jumbo-eqiad cluster: T188136
- 19:06 elukey: cleaned up id=0 rows on db1108 (log database) for T188991
- 10:19 elukey: restart webrequest-load-wf-upload-2018-3-6-7 (failed due to reboots)
- 10:08 elukey: re-starting mysql consumers on eventlog1001
- 09:41 elukey: stop eventlogging's mysql consumers for db1107 (el master) kernel updates
2018-03-05
[edit]- 18:22 elukey: restart webrequest-load-wf-upload-2018-3-5-16 via Hue (failed due to reboots)
- 18:21 elukey: restart webrequest-load-wf-text-2018-3-5-16 via Hue (failed due to reboots)
- 15:00 mforns: rerun mediacounts-load-wf-2018-3-5-9
- 10:57 joal: Relaunch Mediawiki-history job manually from spark2 to see if new versions helps
- 10:57 joal: Killing failing Mediawiki-History job for 2018-03
2018-03-02
[edit]- 15:33 mforns: rerun webrequest-load-wf-text-2018-3-2-12
2018-03-01
[edit]- 14:59 elukey: shutdown deployment-eventlog02 in favor of deployment-eventlog05 in deployment-prep (Ubuntu -> Debian EL migration)
- 09:45 elukey: rerun webrequest-load-wf-text-2018-3-1-6 manually, failed due to analytics1030's reboot
2018-02-28
[edit]- 22:09 milimetric: re-deployed refinery for a small docs fix in the sqoop script
- 17:55 milimetric: Refinery synced to HDFS, deploy completed
- 17:40 milimetric: deploying Refinery
- 08:38 joal: rerun cassandra-hourly-wf-local_group_default_T_pageviews_per_project_v2-2018-2-27-15
2018-02-27
[edit]- 19:12 ottomata: updating spark2-* CLIs to spark 2.2.1: T185581
2018-02-21
[edit]- 20:48 ottomata: now running 2 camus webrequest jobs, one consuming from jumbo (no data yet), the other from analytics. these should be fine to run in parallel.
- 07:21 elukey: reboot db1108 (analytics-slave.eqiad.wmnet) for mariadb+kernel updates
2018-02-19
[edit]- 17:14 elukey: deployed eventlogging - https://gerrit.wikimedia.org/r/#/c/405687/
- 07:35 elukey: re-run wikidata-specialentitydata_metrics-wf-2018-2-17 via Hue
2018-02-16
[edit]- 15:41 elukey: add analytics1057 back in the Hadoop worker pool after disk swap
- 10:55 elukey: increased topic partitions for netflow to 3
2018-02-15
[edit]- 13:54 milimetric: deployment of refinery and refinery-source done
- 12:52 joal: Killing webrequest-load bundle (next restart should be at hour 12:00)
- 08:18 elukey: removed jmxtrans and java 7 from analytics1003 and re-launched refinery-drop-mediawiki-snapshots
- 07:51 elukey: removed default-java packages from analytics1003 and re-launched refinery-drop-mediawiki-snapshots
2018-02-14
[edit]- 13:44 elukey: rollback java 8 upgrade for archiva - issues with Analytics builds
- 13:35 elukey: installed openjdk-8 on meitnerium, manually upgraded java-update-alternatives to java8, restarted archiva
- 13:14 elukey: removed java 7 packages from analytics100[12]
- 12:43 elukey: jmxtrans removed from all the Hadoop workers
- 12:43 elukey: openjdk-7-* packages removed from all the Hadoop workers
2018-02-13
[edit]- 11:42 elukey: force kill of yarn nodemanager + other containers on analytics1057 (node failed, unit masked, processes still around)
2018-02-12
[edit]- 23:16 elukey: re-run webrequest-load-wf-upload-2018-2-12-21 via Hue (node managers failure)
- 23:13 elukey: manual restart of Yarn Node Managers on analytics1058/31
- 23:09 elukey: cleaned up tmp files on all analytics hadoop worker nodes, job filling up tmp
- 17:18 elukey: home dirs on stat1004 moved to /srv/home (/home symlinks to it)
- 17:15 ottomata: restarting eventlogging-processors to blacklist Print schema in eventlogging-valid-mixed (MySQL)
- 14:46 ottomata: deploying eventlogging for T186833 with EventCapsule in code and IP NO_DB_PROPERTIES
2018-02-09
[edit]- 12:19 joal: Rerun wikidata-articleplaceholder_metrics-wf-2018-2-8
2018-02-08
[edit]- 16:23 elukey: stop archiva on meitnerium to swap /var/lib/archiva from the root partition to a new separate one
2018-02-07
[edit]- 13:55 joal: Manually restarted druid indexation after weird failure of mediawiki-history-reduced-wf-2018-01
- 13:49 elukey: restart overlord/middlemanager on druid1005
2018-02-06
[edit]- 19:40 joal: Manually restarted druid indexation after weird failure of mediawiki-history-reduced-wf-2018-01
- 15:36 elukey: drain + shutdown of analytics1038 to replace faulty BBU
- 09:58 elukey: applied https://gerrit.wikimedia.org/r/c/405687/ manually on deployment-eventlog02 for testing
2018-02-05
[edit]- 15:51 elukey: live hacked deployment-eventlog02's /srv/deployment/eventlogging/analytics/eventlogging/handlers.py to add poll(0) to the confluent kafka producer - T185291
- 11:03 elukey: restart eventlogging/forwarder legacy-zmq on eventlog1001 due to slow memory leak over time (cached memory down to zero)
2018-02-02
[edit]- 17:09 joal: Webrequest upload 2018-02-02 hours 9 and 11 dataloss warning have been checked - They are false positive
- 09:56 joal: unique_devices-per_project_family-monthly-wf-2018-1 after failure
2018-02-01
[edit]- 17:00 ottomata: killing stuck JsonRefine eventlogging analytics job application_1515441536446_52892, not sure why this is stuck.
- 14:06 joal: Dataloss alerts for upload 2018-02-01 hours 1, 2, 3 and 5 were false positives
- 12:17 joal: Restart cassandra monthly bundle after January deploy
2018-01-23
[edit]- 20:10 ottomata: hdfs dfs -chmod 775 /wmf/data/archive/mediacounts/daily/2018 for T185419
- 09:26 joal: Dataloss warning for upload and text 2018-01-23:06 is confirmed to be false positive
2018-01-22
[edit]- 17:36 joal: Kill-Restart clickstream oozie job after deploy
- 17:12 joal: deploying refinery onto HDFS
- 17:12 joal: Refinery deployed from scap
2018-01-18
[edit]- 19:11 joal: Kill-Restart coord_pageviews_top_bycountry_monthly ooie job from 2015-05
- 19:10 joal: Add fake data to cassandra to silent alarms (Thanks again ema)
- 18:56 joal: Truncating table "local_group_default_T_top_bycountry"."data" in cassandra before reload
- 15:21 mforns: refinery deployment using scap and then deploying onto hdfs finished
- 15:07 mforns: starting refinery deployment
- 12:43 elukey: piwik on bohrium re-enabled
- 12:40 elukey: set piwik in readonly mode and stopped mysql on bohrium (prep step for reboot)
- 09:38 elukey: reboot thorium (analytics webserver) for security upgrade - This maintenance will cause temporary unavailability of the Analytics websites
- 09:37 elukey: resumed druid hourly index jobs via hue and restored pivot's configuration
- 09:21 elukey: reboot druid1001 for kernel upgrades
- 09:00 elukey: suspended hourly druid batch index jobs via Hue
- 08:58 elukey: temporarily set druid1002 in superset's druid cluster config (via UI)
- 08:53 elukey: temporarily point pivot's configuration to druid1002 (druid1001 needs to be rebooted)
- 08:52 elukey: disable druid1001's middlemanager as prep step for reboot
- 07:11 elukey: re-run webrequest-load-wf-misc-2018-1-18-3 via Hue
2018-01-17
[edit]- 17:33 elukey: killed the banner impression spark job (application_1515441536446_27293) again to force it to respawn (real time indexers not present)
- 17:29 elukey: restarted all druid overlords on druid100[123] (weird race condition messages about who was the leader for some task)
- 16:24 elukey: re-run all the pageview-druid-hourly failed jobs via Hue
- 14:42 elukey: restart druid middlemanager on druid1003 as attempt to unblock realtime streaming
- 14:21 elukey: forced kill of banner impression data streaming job to get it restarted
- 11:44 elukey: re-run pageview-druid-hourly-wf-2018-1-17-9 and pageview-druid-hourly-wf-2018-1-17-8 (failed due to druid1002's middlemanager being in a weird state after reboot)
- 11:44 elukey: restart druid middlemanager on druid1002
- 10:38 elukey: stopped all crons on hadoop-coordinator-1
- 10:37 elukey: re-run webrequest-druid-hourly-wf-2018-1-17-8 (failed due to druid1002's reboot)
- 10:22 elukey: reboot druid1002 for kernel upgrades
- 09:53 elukey: disable druid middlemanager on druid1002 as prep step for reboot
- 09:46 elukey: rebooted analytics1003
- 09:46 elukey: removed upstart config for brrd on eventlog1001 (failing and spamming syslog, old leftover?)
- 08:53 elukey: disabled camus as prep step for analytics1003 reboot
2018-01-15
[edit]- 13:39 elukey: stop eventlogging and reboot eventlog1001 for kernel updates
- 09:58 elukey: rolling reboots of aqs hosts (1005->1009) for kernel updates
- 09:11 elukey: reboot aqs1004 for kernel updates
2018-01-12
[edit]- 13:03 joal: Rerun webrequest-load-wf-text-2018-1-12-9
- 13:02 joal: Rerun webrequest-load-wf-upload-2018-1-12-9
- 10:33 elukey: reboot analytics1066->69 for kernel updates
- 09:07 elukey: reboot analytics1063->65 for kernel updates
2018-01-11
[edit]- 22:35 ottomata: restarting kafka-jumbo brokers to apply https://gerrit.wikimedia.org/r/403774
- 22:04 ottomata: restarting kafka-jumbo brokers to apply https://gerrit.wikimedia.org/r/#/c/403762/
- 20:57 ottomata: restarting kafka-jumbo brokers to apply https://gerrit.wikimedia.org/r/#/c/403753/
- 17:37 joal: Kill manual banner-streaming job to see it restarted by cron
- 17:11 ottomata: restart kafka on kafka-jumbo1003
- 17:08 ottomata: restart kafka on kafka-jumbo1001...something is not right with my certpath change yesterday
- 14:46 joal: Deploy refinery onto HDFS
- 14:33 joal: Deploy refinery with Scap
- 14:07 joal: Manually restarting banner streaming job to prevent alerting
- 13:23 joal: Killing banner-streaming job to have it auto-restarted from cron
- 11:45 elukey: re-run webrequest-load-wf-text-2018-1-11-8 (failed due to reboots)
- 11:39 joal: rerun mediacounts-load-wf-2018-1-11-8
- 10:48 joal: Restarting banner-streaming job after hadoop nodes reboot
- 10:01 elukey: reboot analytics1059-61 for kernel updates
- 09:34 elukey: reboot analytics1055->1058 for kernel updates
- 09:04 elukey: reboot analytics1051->1054 for kernel updates
2018-01-10
[edit]- 16:56 elukey: reboot analytics1048->50 for kernel updates
- 16:23 ottomata: restarting kafka jumbo brokers to apply java.security certpath restrictions
- 11:51 elukey: re-run webrequest-load-wf-upload-2018-1-10-10 (failed due to reboots)
- 11:27 elukey: re-run webrequest-load-wf-text-2018-1-10-10 (failed due to reboots)
- 11:26 elukey: reboot analytics1044->47 for kernel updates
- 11:03 elukey: reboot analytics1040->43 for kernel updates
2018-01-09
[edit]- 16:53 joal: Rerun pageview-druid-hourly-wf-2018-1-9-13
- 15:33 elukey: stop mysql on dbstore1002 as prep step for shutdown (stop all slaves, mysql stop)
- 15:10 elukey: reboot analytics1028 (hadoop worker and hdfs journal node) for kernel updates
- 15:00 elukey: reboot kafka-jumbo1006 for kernel updates
- 14:41 elukey: reboot kafka-jumbo1005 for kernel updates
- 14:33 elukey: reboot kafka1023 for kernel updates
- 14:04 elukey: reboot kafka1022 for kernel updates
- 13:51 elukey: reboot kafka-jumbo1003 for kernel updates
- 10:08 elukey: reboot kafka-jumbo1002 for kernel updates
- 09:35 elukey: reboot kafka1014 for kernel updates
2018-01-08
[edit]- 19:07 milimetric: Deployed refinery and synced to hdfs
- 15:23 elukey: reboot kafka1013 for kernel updates
- 13:40 elukey: reboot analytics10[36-39] for kernel updates
- 12:59 elukey: reboot kafka1012 for kernel updates
- 12:43 joal: Deploy AQS from tin
- 12:36 fdans: Deploying AQS
- 12:33 joal: Update fake-data in cassandra adiing top-by-country needed row
- 11:07 elukey: re-run webrequest-load-wf-text-2018-1-8-8 (failed after some reboots due to kernel updates)
- 10:07 elukey: drain + reboot analytics1029,1031->1034 for kernel updates
2018-01-07
[edit]- 09:01 elukey: re-enabled puppet on db110[78] - eventlogging_sync restarted on db1108 (analytics-slave)
2018-01-06
[edit]- 08:09 elukey: re-enable eventlogging mysql consumers after database maintenance
2018-01-05
[edit]- 13:18 fdans: deploying AQS
2018-01-04
[edit]- 19:54 joal: Deploying refinery onto hadoop
- 19:45 joal: Deploy refinery using scap
- 19:38 joal: Deploy refinery-source using jenkins
- 16:01 ottomata: killing json_refine_eventlogging_analytics job that started yesterday and has not completed (has no executors running?) application_1512469367986_81514. I think the cluster is just too busy? mw-history job running...
- 10:34 elukey: re-run mediacounts-archive-wf-2018-01-03
2018-01-03
[edit]- 15:00 ottomata: restarting kafka-jumbo brokers to enable tls version and cipher suite restrictions
2018-01-02
[edit]- 11:13 joal: Kill and restart cassandra loading oozie bundle to pick new pageview_top_bycountry job
- 08:22 elukey: restart druid coordinators to pick up new jvm settings (freeing up 6GB of used memory)