Analytics/Server Admin Log/Archive/2017
Appearance
2017-12-21
[edit]- 15:54 joal: Start backfilling monthly pageview-by-country
- 15:45 joal: deploy refinery onto HDFSb
- 15:38 joal: Deploying refinery with Scap
2017-12-20
[edit]- 15:40 ottomata: removing some old webrequest data from hdfs
- 14:56 ottomata: dropping some old wmf.webrequest partitions and data
2017-12-19
[edit]- 17:28 elukey: re-enabled superset
- 17:16 joal: Initilizaing new cassandra keyspace for pageviews/top-by-country
- 16:52 elukey: temporarily stop superset to test druid's performances
- 16:34 elukey: manually started eventlogging cleaner on db1107 to purge/sanitize data up to 90 days ago (tmux is running for user eventlogcleaner)
- 14:10 elukey: temporary changed JVM Heap settings for the druid broker on druid1001 - Xmx25g Xms10g (run puppet and restart the daemon to rollback)
2017-12-12
[edit]- 15:54 milimetric: sieging aqs1004 with 100.000 transactions
2017-12-11
[edit]- 14:07 elukey: disable druid middlemanager on druid1003 to drain + restart to pick up new logging settings
- 12:59 elukey: disabled druid middlemanager on druid1002 to drain+restart with new logging config
2017-12-08
[edit]- 11:15 joal: Start mediawiki-history oozie jobs new-version
- 10:49 joal: Update wmf.mediawiki_history as explained in email (rename current table to old, create new one)
2017-12-07
[edit]- 21:09 joal: Start clickstream oozie job
- 20:45 joal: Kill restbase oozie job and restart apis replacing one
- 20:12 joal: Trying to deploy refinery again
- 19:30 joal: Deploying refinery now that -source is deployed
- 18:39 milimetric: Deployed refinery-source using jenkins
- 15:03 elukey: restart webrequest-misc load job (Dec 7 2017 06:00:00)
- 12:24 elukey: camus re-enabled after analytics1003 reboot
- 08:31 elukey: stop camus on an1003 as prep step for reboot
2017-12-06
[edit]- 14:55 elukey: restart hue to pick up the new oozie server
- 14:55 elukey: oozie server accidentally restarted due to a puppet change (the service auto-restarts)
- 11:22 elukey: disabled temporarily druid's middlemanager on druid1001 to test the Real Time monitor setting
2017-12-05
[edit]- 10:35 elukey: re-enable webrequest bundle and camus after reboots
- 10:31 elukey: disabled druid middlemanager on druid1003 with curl -X POST http://druid1003.eqiad.wmnet:8091/druid/worker/v1/disable
- 10:03 elukey: stop camus as precautionary measure before Hadoop masters reboot
- 09:57 elukey: suspend webrequest load bundle as extra precaution before Hadoop masters reboot
2017-12-04
[edit]- 16:29 elukey: restart webrequest-load-wf-upload-2017-12-4-12 (failed due to hadoop reboots)
- 16:12 elukey: restart webrequest-load-wf-upload-2017-12-4-13 (failed due to hadoop reboots)
- 15:09 joal: Rerun webrequest-load-wf-upload-2017-12-4-12 and webrequest-load-wf-upload-2017-12-4-13
- 15:08 joal: Rerunning 15:47:35 < fdans> whatuuuup mforns
- 14:17 elukey: re-run pageview-druid-hourly-wf-2017-12-4-11 in Hue (failed due to reboots)
- 12:04 elukey: re-run webrequest-load-wf-upload-2017-12-4-8 (failed due to reboots)
- 12:04 elukey: re-run webrequest-load-check_sequence_statistics-wf-upload-2017-12-4-7 (failed due to reboots)
2017-12-02
[edit]- 11:47 joal: Rerun unique_devices-per_project_family-monthly-wf-2017-11
2017-12-01
[edit]- 15:20 elukey: rerun webrequest-druid-hourly-wf-2017-12-1-8 after an unexpected Druid Overlord inconsistency
- 15:09 elukey: rerun pageview-druid-hourly-wf-2017-12-1-8 after an unexpected Druid Overlord inconsistency
- 13:07 elukey: re-run aqs-hourly-wf-2017-12-1-8 (failed due to Hadoop reboots)
- 12:42 elukey: temporarily switch pivot's config to druid1002 (to reboot druid1001)
- 12:37 elukey: re-run webrequest-load-wf-upload-2017-12-1-10 and webrequest-load-wf-upload-2017-12-1-7 (failed due to Hadoop reboots)
- 12:36 elukey: re-run webrequest-load-wf-text-2017-12-1-10 and webrequest-load-wf-text-2017-12-1-9 (failed due to Hadoop reboots)
- 12:35 elukey: re-run pageview-hourly-wf-2017-12-1-8 (failed due to Hadoop reboots)
- 12:34 elukey: re-run webrequest-druid-hourly-wf-2017-12-1-8 (failed due to Hadoop reboots)
2017-11-30
[edit]- 18:20 elukey: re-run webrequest-load-wf-upload-2017-11-30-16 (failed due to hadoop reboots)
- 18:19 elukey: re-run webrequest-load-wf-text-2017-11-30-14 (failed due to hadoop reboots)
- 16:21 joal: wikidata-wdqs_extract-wf-2017-11-30-15
- 15:50 elukey: restart hue on thorium - timeouts and 500s
- 14:58 joal: Update druid overlord config to equalDistribution dynamically
2017-11-29
[edit]- 21:46 joal: rerun pageview-druid-hourly-wf-2017-11-29-18 and pageview-druid-hourly-wf-2017-11-29-19
- 21:19 joal: rerun webrequest-druid-hourly-wf-2017-11-29-18
2017-11-28
[edit]- 14:41 ottomata: restarting eventlogging on eventlog1001 for https://gerrit.wikimedia.org/r/#/c/393613/
- 09:08 elukey: log database on dbstore1002 dropped for good
2017-11-22
[edit]- 16:09 ottomata: restarting eventlogging services on eventlog1001
2017-11-20
[edit]- 18:28 elukey: deployed prometheus-druid-exporter (still not released in apt) on druid1004 for testing
- 15:45 ottomata: deploying fixes to EL EventCapsule discrepancies: https://phabricator.wikimedia.org/T179625#3755242
2017-11-16
[edit]- 15:25 milimetric: deployed refinery and running interlanguage links dataset now
2017-11-15
[edit]- 14:22 addshore: addshore@stat1005:/srv/analytics-wmde$ sudo -u analytics-wmde rm -rf /srv/analytics-wmde/r-library
- 14:22 addshore: addshore@stat1005:/srv/analytics-wmde$ sudo -u analytics-wmde rm -rf /srv/analytics-wmde/installRlib
2017-11-14
[edit]- 09:45 elukey: executed chmod g+rx /home/ezachte/wikistats_data/dumps to unblock Joseph (should be safe)
2017-11-13
[edit]- 21:20 addshore: addshore@stat1005:/srv/analytics-wmde/wdcm/src$ sudo -u analytics-wmde Rscript ./_installProduction_analytics-wmde.R
- 21:20 addshore: test
- 14:44 joal: Resuming all druid loading jobs after fixing restart issues
- 14:18 joal: Suspending pageview-druid-hourly-coord again trying to fix druid loadin
- 14:10 joal: Unsuspend pageview-druid-hourly-coord
- 13:08 joal: Suspend webrequest druid loading waiting for elukey
- 13:05 joal: Rerun webrequest-druid-hourly-wf-2017-11-13-11
- 11:15 elukey: suspend pageview-druid-hourly-coord to allow an easier druid daemon reload (new prometheus jvm agent)
2017-11-08
[edit]- 15:16 ottomata: deploying eventlogging analytics change for eventcapsule schema fixes, will be no-op until we deploy puppet changes too
- 11:28 elukey: resumed cassandra-coord-pageview-per-project-hourly after maintenance to aqs hosts
- 10:04 elukey: suspended cassandra-coord-pageview-per-project-hourly as prep step to reboot aqs nodes - T179943
2017-11-06
[edit]- 15:37 milimetric: found geowiki was hitting the wrong databases, updated it to always hit analytics-store
2017-11-03
[edit]- 10:55 joal: Kill mediawiki-history oozie job to prevent computing october snapshot before fixing reconstruction process
2017-11-02
[edit]- 08:54 elukey: relaunched failed pageview-druid-hourly jobs - Druid indexation check failures in the logs (01 Nov 2017 21:00:00 and 01 Nov 2017 19:00:00)
2017-11-01
[edit]- 20:06 ottomata: rerunning pageview-druid-hourly-wf-2017-11-1-18
- 19:05 ottomata: deploying refinery with refinery/source 0.0.54 for JsonRefine job T162610
- 18:40 ottomata: rerunning unique_devices-per_project_family-druid-monthly-wf-2017-10
2017-10-30
[edit]- 10:12 elukey: added Francisco to the analytics-alerts@ mailing list
2017-10-27
[edit]- 07:40 elukey: re-run wikidata-articleplaceholder_metrics-wf-2017-10-26
- 07:36 elukey: stop & mask hadoop-httpfs.service on analytics1001 after https://gerrit.wikimedia.org/r/#/c/386684/
2017-10-26
[edit]- 16:58 ottomata: now mirroring main Kafka cluster topics to jumbo Kafka cluster, with MirrorMaker instances running on analytics-eqiad broker nodes. https://phabricator.wikimedia.org/T177216
2017-10-25
[edit]- 13:32 elukey: restart yarn nodemanager and hdfs datanode on analytics1030 to apply new JVM settings
2017-10-24
[edit]- 20:29 nuria_: started unique_devices-per_project_family-druid-daily-coord 0102816-170829140538136-oozie-oozi-C
- 20:24 nuria_: restarted job unique_devices-per_project_family-druid-monthly-coord 0102799-170829140538136-oozie-oozi-C
- 20:23 nuria_: restarted job uniques-monthly-per-domain-druid 0102785-170829140538136-oozie-oozi-C
- 19:44 nuria_: killing druid coordinators uniques-monthly and per-project-family: 0066771-170829140538136-oozie-oozi-C,0066767-170829140538136-oozie-oozi-C,0010139-170621131133576-oozie-oozi-C
2017-10-23
[edit]- 18:50 joal: Deploying AQS after fix
- 13:30 joal: deploy AQS from tin
2017-10-19
[edit]- 20:04 mforns: Deployed refinery using scap, then deployed onto hdfs
- 11:44 joal: deploying AQS in beta
- 11:44 joal: deploying AQS in b
2017-10-16
[edit]- 17:32 mforns: restarted EventLogging for changes in blacklist to take effect
- 16:27 joal: Re-Deploy AQS after monitoring fix
- 16:14 joal: Deploy AQS with new code
2017-10-13
[edit]- 16:49 ottomata: deployed refinery to use rand() for webrequest sampling
2017-10-12
[edit]- 15:40 elukey: run kafka preferred-replica-election to allow kafka1013 to re-join the topic leaders
- 14:48 elukey: disable httpfs access on analytics1001
2017-10-09
[edit]- 18:28 ottomata: resuming oozie druid indexing jobs, 1004-1006 are offline
- 16:34 ottomata: stopping druid services on druid1006
- 16:05 ottomata: pausing all druid oozie coordinators in preperation for druid public separation
- 12:47 joal: Kill restart oozie job lading mediawiki-history into druid
- 12:14 joal: Kill-Restart oozie jobs loading banner data into druid
- 12:04 joal: Deploy refinery onto HDFS
- 11:47 joal: Deploying refinery from scap
- 08:53 joal: Rerunning wikidata-articleplaceholder_metrics-wf-2017-10-7 after failure
2017-10-06
[edit]- 11:10 elukey: restart all druid daemons to pick up new logging changes
- 11:08 joal: Rerun pageview-druid-hourly-wf-2017-10-6-9
- 09:31 elukey: restart all the druid daemons on druid1005 to apply the new logging rules
- 08:49 elukey: restarted all the druid broker daemons to pick up the new logging changes
2017-10-05
[edit]- 13:48 milimetric: restarted banner_activity-druid-monthly for September again
2017-10-04
[edit]- 18:39 ottomata: druid-analytics.svc.eqiad.wmnet:8082 should only be accessible to analytics networks
- 17:32 ottomata: deploying new LVS service for druid-analytics-broker
2017-10-03
[edit]- 14:50 milimetric: restarted failed workflow 0057215-170829140538136-oozie-oozi-W (druid monthly banner activity)
2017-09-28
[edit]- 10:02 elukey: renabled camus after maintenance
- 09:51 elukey: restart mapreduce history server on an1001 to apply new heap settings (Xmx/s to 4g)
2017-09-27
[edit]- 15:18 joal: Kill/restart stuck jobs
- 14:45 elukey: rolling restart of all the Yarn nodemanager daemons on analytics1028-1068 (ease heap consumption pressure, seamless restart)
- 13:40 elukey: manual failover of HDFS namenode from an1002 to an1001
- 13:17 elukey: manual failover of HDFS namenode from an1001 to an1002 to test 6G max heap size
- 13:14 elukey: restart mapreduce history server on analytics1001 after crash (java.lang.OutOfMemoryError: GC overhead limit exceeded)
2017-09-26
[edit]- 14:49 joal: restart mobile_apps session_metrics bundle
- 14:49 joal: restart
- 11:01 joal: Restart mediawiki-history-denormalize and mediawiki-history-druid jobs after deploy
- 10:58 joal: Restart webrequest load job after deploy
- 10:35 joal: Deploying refinery onto HDFS
- 10:25 joal: Deploy Refinery with scap
- 09:33 joal: Releasing refinery-source v0.0.53 with Jenskins
2017-09-25
[edit]- 08:41 joal: Rerun mobile_apps-session_metrics-wf- 2017-9-17 after failure
2017-09-19
[edit]- 19:24 joal: Rerun pageview-druid-hourly-wf-2017-9-19-17 failed during druid restart
- 19:19 ottomata: restarting druid broker and historical processes with druid.processing.numMergeBuffers=10
2017-09-14
[edit]- 17:35 ottomata: restaring eventlogging processor(s) with MySQL blacklist of PageCreation schema
2017-09-13
[edit]- 16:08 ottomata: restarting druid-brokers with increase in query cache size
- 11:15 joal: Kill-Restart mediawiki-history-denormalize-coord and launch new coords mediawiki-history-load and mediawiki-history-reduced
- 11:11 joal: Kill-Restart oozie pageview druid loading jobs (hourly, daily, monthly)
- 11:03 joal: Deploy refinery onto HDFS
- 10:57 joal: Deploy refinery from scap
- 10:08 joal: Deploying refinery-source using Jenkins
2017-09-07
[edit]- 08:41 joal: Rerun Workflow banner_activity-druid-daily-wf-2017-9-6
2017-09-04
[edit]- 08:55 joal: Kill - Restart mediawiki-history-druid-coord to pick last update
2017-09-01
[edit]- 18:30 joal: Rerun Workflow webrequest-load-wf-misc-2017-9-1-16 after very weird failure
- 10:06 elukey: killed root rsyncs on thorium, disabled puppet
- 01:31 ottomata: restarted hue (a few minutes ago) not totally sure why it died
2017-08-30
[edit]- 15:54 elukey: re-added analytics1055 among the hdfs/yarn worker after maintenance
- 14:07 elukey: restart java daemons on druid100[456] for jvm security updates
- 09:07 elukey: restart all jvm daemons on druid100[123] for security updates
- 09:07 elukey: restart pageview-druid-hourly-wf-2017-8-30-7 in Hue after druid1001 daemons restart
2017-08-29
[edit]- 19:36 ottomata: restarting all kafka brokers and mirror maker processes to apply https://gerrit.wikimedia.org/r/#/c/374610/
- 12:46 elukey: suspend oozie jobs from Hue to allow a easier restart of oozie/hive daemons
2017-08-28
[edit]- 13:57 elukey: restart kafka* on kafka1012 for openjdk security updates (canary)
- 10:34 elukey: restart yarn and hdfs on analytics1030 for jvm updates (canary)
2017-08-23
[edit]- 19:50 joal: restart oozie webrequest-load bundle after bug correction
- 19:46 joal: Deploy refinery onto hdfs
- 19:41 joal: Deploying refinery from tin
- 19:36 joal: Deployed werbrequest-source using jenkins for bug correction
- 19:26 joal: Alter wmf.webrequest and wmf.wdqs_extract tables to correct bug
- 19:25 joal: Kill oozie webrequest-load bundle for redeploy after bug correction
- 11:04 joal: Update wmf.wdqs_extract table for normalized_host update
- 10:12 joal: Restart oozie webrequest-load bundle after deploy and updates
- 10:09 joal: Alter webrequest table before restarting oozie load bundle
- 10:06 joal: Deploying refinery onto hdfs
- 09:59 joal: Deploying refinery
- 09:59 joal: Kill oozie webrequest-load bundle for restart after deploy
- 08:25 joal: Deploying refinery-source v0.0.50 using jenkins
2017-08-22
[edit]- 19:52 joal: Drop / recreate wmf.mediawiki_history table for naming correction
- 13:57 ottomata: sudo -u hdfs hdfs dfs -rm /tmp/druid-indexing/classpath/guava.jar (guava 11.0.2 is conflicting with guava 16.0.1. from druid-hdfs-storage-cdh extension). Not sure how guava 11.0.2 got there, but let's see if it doesn't come back
- 08:27 joal: Rerun druid loading jobs after night failures
2017-08-21
[edit]- 13:46 ottomata: adding index on (database, rev_timestamp) on mediawiki_page_create_2 table on db1047: T170990
- 13:26 ottomata: adding index on (database, rev_timestamp) on mediawiki_page_create_2 table on dbstore1002: T170990
2017-08-14
[edit]- 16:40 elukey: analytics1034 back in service after swapping the eth cable - T172633
2017-08-10
[edit]- 20:06 milimetric: stopped Wikimetrics web and queue on wikimetrics-01.eqiad.wmflabs because the queue ran into errors connecting to the database (max 10 connections limit reached)
- 08:59 elukey: updated librdkafka1 to 0.9.4.1 on eventlog1001
2017-08-08
[edit]- 18:39 elukey: restart projectview-hourly-wf-2017-8-8-14, pageview-druid-hourly-wf-2017-8-8-14, pageview-hourly-wf-2017-8-8-14 via Hue (analytics1055 disk failure)
- 14:20 elukey: restart varnishkafka statsv/eventlogging instances to pick up https://gerrit.wikimedia.org/r/#/c/370637/ (kafka protocol explicitly set to 0.9.0.1)
2017-08-06
[edit]- 11:03 elukey: stop yarn on analytics1034 to reload the tg3 driver - T172633
2017-08-03
[edit]- 16:15 ottomata: druid cluster restarted with 0.9.2 mysql-metadata-storage extension, un-suspending oozie druid jobs
- 14:11 ottomata: pausing oozie druid jobs and doing a cluster upgrade/restart again to make sure updated version of mysql-metadata-storage jar is properly loaded
- 09:56 elukey: set piwik in maintenance mode to allow mysql updates
- 08:08 elukey: restarted Druid jobs failed over night (drud_loader.py error) and due to Hive metastore restart
- 08:03 elukey: restart hive-metastore to pick up new JVM Xms settings
2017-08-02
[edit]- 14:34 ottomata: beginning druid upgrade to 0.92 (take 2 :) )
- 14:23 elukey: restart hive-server to pick up JVM Xms4g change
- 14:22 ottomata: suspending druid oozie jobs
2017-08-01
[edit]- 17:24 ottomata: beginning druid upgrade to 0.9.2 http://druid.io/docs/0.9.2/operations/rolling-updates.html
- 17:10 ottomata: pausing all druid oozie coordinators
- 12:49 elukey: restart hive daemons on analytics1003 to pick up new jvm settings (bigger Xmx, JMX ports)
- 10:05 elukey: suspended again webrequest-load-bundle as prep step to restart the hive daemons
- 07:58 elukey: suspended webrequest-load-bundle as prep step to restart the hive daemons
- 07:03 elukey: restarted mobile_apps-session_metrics-coord-global-30days failed job via Hue
2017-07-31
[edit]- 13:45 elukey: suspended webrequest-load-bundle as prep step to restart hive metastore/server
- 10:34 elukey: restart hive-server on an1003 - beeline not connecting, thrift errors
2017-07-28
[edit]- 07:55 elukey: update nodejs to 6.11 on aqs1004 (testing prod node after beta qa)
- 07:54 elukey: re-run webrequest-load-wf-upload-2017-7-28-6 from Hue (was playing with eth0 issues on an1034)
- 02:08 ottomata: stat1002: disabled puppet, umounted /tmp, /home and /a, poweroff
2017-07-26
[edit]- 21:01 mforns: Deployed refinery using scap, then deployed onto hdfs
- 18:57 mforns: Deployed refinery-source using jenkins
2017-07-25
[edit]- 15:24 elukey: restart cassandra loading after maintenance via hue
- 13:06 elukey: stop cassandra load bundle, restarting AQS for jvm updates
- 12:13 elukey: executed sudo apt-get remove openjdk-8-jre openjdk-8-jre-headless on druid nodes
2017-07-24
[edit]- 14:24 ottomata: restarted mysql-eventbus eventlogging consumer with new consumer group
2017-07-20
[edit]- 20:31 nuria_: restaring eventlogging on eventlog1001
- 20:30 nuria_: deploying eventlogging c1c2c39411ccd002ff8cea197bc535155213f5fb and restarting
- 18:18 ottomata: deleted instance deployment-eventlogging03 in favor of new instance deployment-eventlog02
- 17:14 ottomata: killed tranquility instances tranq-banners and tranq-netflow running on druid1003 in joal's screen sessions
2017-07-18
[edit]- 13:04 ottomata: adding unique index on meta_id and index on meta_dt to mediawiki_page_{create,delete,move,undelete}_1 on db1046 MySQL eventlogging master
2017-07-17
[edit]- 16:27 elukey: set innodb_flush_log_at_trx_commit on bohrium to 2 and sync_binlog=300 to reduce iowait - T164073
- 14:31 elukey: set innodb_flush_log_at_trx_commit on bohrium to 1 (default value)- T164073
2017-07-12
[edit]- 13:48 fdans: updated pageview whitelist with din.wikipedia
2017-07-11
[edit]- 05:24 elukey: drop _Edit_11448630_old from dbstore1002
2017-07-10
[edit]- 16:14 nuria_: deploying eventlogging 5e16da16e3f5ce287829390a76b9f5b0c7715ee5
2017-07-08
[edit]- 07:55 elukey: re-run wikidata-specialentitydata_metrics-wf-2017-7-7 in Hue (failed Spark job)
2017-07-06
[edit]- 10:37 elukey: taking mysqldump for Piwik and storing it on stat1002:/a/backup/bohrium/mysqldump_20170706.sql
2017-07-04
[edit]- 11:21 joal: Redeploying refinery with scap
- 11:10 joal: Restart unique_devices-per_project_family-monthly-coord after correction deployed
- 11:03 joal: Deploying refinery onto hdfs
- 10:57 joal: Deploying refinery with scap
2017-07-03
[edit]- 16:40 joal: Manually launch sqoop imports for enwiki revision, and wikidatawiki revision and logging tables, snapshot=2017-06
2017-07-01
[edit]- 21:33 joal: Restart cassandra bundle at beginning of the month
2017-06-29
[edit]- 11:39 joal: Update tables and archived data and kill/start jobs for unique-devices per project-family
- 11:34 joal: Kill and restart druid webrequest sampled oozie jobs after deploy
- 11:18 joal: Update tables and restart mediawiki_history oozie jobs after deploy
- 10:58 elukey: deploy refinery to HDFS
- 10:57 elukey: fixed archiva whitelist in the analytics VLAN (VM changed IP)
- 07:03 joal: Deploying refinery with scap (after yesterday's failure)
2017-06-28
[edit]- 18:17 joal: Deploying refinery with scap
- 16:25 joal: Building / Deploying refinery-source from jenkins to archiva (v0.0.480
- 15:42 elukey: analytics1030 back to the worker nodes after maintenance
2017-06-27
[edit]- 16:26 milimetric: quarry Rebooted all the boxes in an attempt to fix performance problems
- 10:05 elukey: added https://wiki.apache.org/commons/VfsProblems to stat1004
- 07:14 joal: Rerun wikidata-articleplaceholder_metrics-wf-2017-6-26
2017-06-24
[edit]- 10:31 elukey: re-run webrequest-load-coord-misc's failed job in hue
2017-06-23
[edit]- 07:32 elukey: uploaded new pageview whitelist following https://wikitech.wikimedia.org/wiki/Analytics/Team/Oncall#Find_and_fix_pageview_whitelist_exceptions for kbp.wikipedia
2017-06-21
[edit]- 20:23 joal: Disable puppet agent and restart kafka with 48h retention in deployment-kafka01
- 13:59 elukey: eventlogging restarted after reboot
- 13:54 elukey: stop eventlogging and reboot eventlog1001
- 13:15 elukey: reboot analytics1003 for kernel update
- 11:08 elukey: stop camus on an1003
2017-06-20
[edit]- 19:24 ottomata: beginning to consume select eventbus event using eventlogging mysql consumer and inserting into eventlogging analytics mysql db
- 18:01 joal: Rerun webrequest-load-wf-text-2017-6-20-12 after oozie failure
- 16:23 joal: Restarted tranquility for banners and netflow on druid1003
- 16:18 joal: Rererun pageview-druid-hourly-wf-2017-6-20-14 (failed due to druid reboots)
- 16:04 elukey: re-run pageview-druid-hourly-wf-2017-6-20-14 (failed due to druid reboots)
- 14:46 elukey: re-run failed webrequest-load-text/upload jobs due to reboots
- 13:29 elukey: restart webrequest-load-coord-text and webrequest-load-coord-upload failed jobs due to reboots
- 13:14 elukey: re-run wikidata-wdqs_extract-wf-2017-6-20-11 (failed for connection issues, likely due to reboots)
- 11:54 joal: Deleting old unique_devices data (renamed to unique_devices_per_domain)
- 10:27 elukey: reboot kafka1012, analytics1028, aqs1004 for kernel upgrades (canary hosts)
- 08:51 elukey: manually added the user 'hdfs' to the 'hive' group to be able to run refinery-drop-webrequest-partitions
- 08:49 elukey: manually running /srv/deployment/analytics/refinery/bin/refinery-drop-webrequest-partitions on an1003 to free hdfs space
2017-06-19
[edit]- 12:10 elukey: disable BBU auto learn on all the hadoop workers
2017-06-13
[edit]- 10:10 elukey: merged big zookeeper refactoring https://gerrit.wikimedia.org/r/#/c/354449 - Druid's Hadoop client config now correctly points to conf1* and not drud1*
2017-06-12
[edit]- 17:21 joal: Last deploy of the day for uniques patch
- 13:26 joal: redeploying refinery after bug patch
- 11:32 joal: Change production last_access_uniques dataset to unique_devices/per_domain
- 11:11 joal: Deploy refinery onto HDFS
- 11:03 joal: Regular weekly deploy of refinery (mostly unique_devices patches)
- 10:54 joal: Refinery-source deployed to archiva
2017-06-08
[edit]- 16:41 nuria_: deploying refinery to cluster
- 13:44 elukey: AQS cluster in beta wiped and re-bootstrapped due to T167222
- 12:54 elukey: run megacli -LDSetProp ADRA -LALL -aALL on analytics1047 to set ReadAheadAdaptive on analytics[1042-1046,1048-1057].eqiad.wmnet - T166140
- 12:16 elukey: run megacli -LDSetProp ADRA -LALL -aALL on analytics1047 to set ReadAheadAdaptive - T166140
- 10:35 elukey: executed megacli -LDSetProp NoCachedBadBBU -LALL -aALL on analytics1049/45
- 10:28 elukey: executed megacli -LDSetProp NoCachedBadBBU -LALL -aALL on analytics1032 as test - T166140
- 07:25 elukey: kill maps webrequest load coordinator as temporary measure to avoid oozie spamming
- 07:21 elukey: suspended cache maps as temporary measure to avoid oozie spamming
2017-06-07
[edit]- 06:50 elukey: restarted mediacounts-archive-wf-2017-06-06 in Hue (Java OOMs)
2017-06-06
[edit]- 15:44 ottomata: restarting eventlogging mysql consumer to allow is_mediawiki events through is_not_bot filter
- 15:24 ottomata: restarting eventlogging processor to bring in is_mediawiki ua classification
2017-06-02
[edit]- 14:48 mforns: Restarted webrequest-load-bundle after deploy
- 08:41 joal: Restarted last_access_uniques-monthly-coord after bug correction and deploy
- 04:42 elukey: removed some old scap revs for the Analytics refinery on stat1002 to free space
2017-06-01
[edit]- 14:29 mforns: Deployed refinery using scap, then deployed onto hdfs
- 12:47 mforns: Deployed refinery-source v0.0.46 using jenkins
2017-05-29
[edit]- 09:45 joal: Restarted wikidata-articleplaceholder_metrics-wf-2017-5-27
2017-05-26
[edit]- 12:54 elukey: restarted master Hadoop daemons for jvm upgrade
- 12:39 elukey: re-added analytics1030 to the hadoop workers
2017-05-25
[edit]- 10:11 elukey: removed /usr/share/druid/extensions/druid-hdfs-storage-cdh/druid-hdfs-storage-0.10.0.jar from all druid nodes
- 07:23 joal: Restart pageview-druid-hourly-wf-2017-5-24-19
2017-05-24
[edit]- 21:09 ottomata: pausing all druid oozie coordinators until hadoop loading is fixed
- 15:27 joal: Resume pageview-druid-hourly-coord after druid upgrade
- 13:07 joal: Suspend pageview-druid-hourly-coord for druid upgrade
- 09:06 joal: Restart oozie mediawiki_history denormalize/metrics job after bug fixing deploy
- 09:04 joal: Restart oozie last_accesst_uniques daily/monthly job after bug fixing deploy
- 09:01 joal: Restart oozie restbase job after bug fixing deploy
- 08:52 joal: Deploy refinery to HDFS
- 08:48 joal: Deploying refine
2017-05-23
[edit]- 14:32 joal: Start 1-off oozie jobs adding underestimate and offset values in historical archived uniques datasets
- 13:50 joal: Restarted oozie last_access_uniques jobs (daily + monthly) after deploy
- 13:47 joal: Restarted oozie druid hourly pageview job after deploy
- 13:46 joal: Restarted oozie druid uniques job after deploy
- 12:56 joal: Deploying refinery to HDFS
- 12:10 joal: Start refinery deployment
2017-05-16
[edit]- 22:54 elukey: disabled puppet and hadoop daemons again on analytics1030 (still need hw maintenance but motherboard replaced)
- 22:54 elukey: analytics1040 back to the hadoop worker nodes after maintenance
2017-05-09
[edit]- 10:13 elukey: re-run manually 2017-05-08T18 for misc due to job errors (failed oozie id 0020276-170424154741156-oozie-oozi-W)
2017-05-05
[edit]- 13:08 elukey: restart Pivot on thorium after banner_activity_minutely_sanitization_test cleanup
- 12:02 elukey: removed /etc/cron.d/piwik-archive on bohrium, now puppet creates it for user www-data
2017-05-04
[edit]- 16:26 elukey: set daily cron archiver (rather than every hour) for Piwik on bohrium
- 10:09 joal: Rerun full druid loading for daily uniques - 0012911-170424154741156-oozie-oozi-C
- 09:17 joal: Deploy refinery onto hdfs
- 09:13 joal: Deploy refinery from naos :)
2017-05-03
[edit]- 08:32 elukey: added "adapter=MYSQLI" to config.ini to enable LOAD FILE capabilities on piwik (restarted apache2)
- 08:20 elukey: GRANT FILE on *.* to piwik@localhost executed on bohrium (https://piwik.org/faq/troubleshooting/#faq_194)
- 08:16 elukey: removed 2>&1 from the Piwik cron archive script
- 08:12 elukey: set Piwik archive cron on bohrium to run every 3600s (rather than 14400)
2017-05-02
[edit]- 12:57 elukey: set long_query_time=5 to mysql on bohrium
- 12:54 elukey: enabled mysql slow query log on bohrium (/var/log/mysql/slow-query.log0
- 09:53 joal: Restart mediawiki history jobs to pick up new snapshot format
2017-04-28
[edit]- 08:45 joal: Restart Workflow webrequest-load-wf-maps-2017-4-28-1
2017-04-27
[edit]- 16:40 joal: Manually push (again) pageview whitelist
- 16:37 joal: restart Workflow aqs-hourly-wf-2017-4-27-14 and Workflow pageview-hourly-wf-2017-4-27-14
- 12:06 joal: Manually push pageview whitelist to silence oozie alerts
- 09:47 elukey: re-enabled tracking in piwik after maintenance
- 09:44 elukey: disabled tracking in piwik to allow mysql upgrade
2017-04-26
[edit]- 18:51 elukey: resumed oozie the complainer on Hue
- 10:12 elukey: restarted webrequest-load-(text|upload|misc|maps) failed jobs (Hadoop workers maintenance)
- 09:53 elukey: restart mediacounts-load-wf-2017-4-26-7 (failed due to mainteance to the hadoop cluster)
- 09:51 elukey: restart aqs-hourly-wf-2017-4-26-8 (failed due to an1036's hdfs daemon went down for maintenance)
2017-04-25
[edit]- 10:33 joal: restart failed mediacounts-archive-coord : Workflow mediacounts-archive-wf-2017-04-24
2017-04-24
[edit]- 15:54 elukey: re-enabled oozie bundles webrequest-load and transwer_to_es
- 13:22 elukey: disable camus cron on an1003
- 13:08 elukey: suspended transfer_to_es bundle
- 13:07 elukey: suspended webrequest-load-bundle
2017-04-21
[edit]- 13:51 elukey: set innodb_flush_log_at_trx_commit = 0 and sync_binlog = 300 on bohrium's mysql
- 10:35 elukey: restart pivot for nodejs security upgrades
2017-04-20
[edit]- 17:32 joal: Start daily uniques druid loading job (from 2015-12-17)
- 17:26 joal: Restart druid pageview loading [daily-monthly]
- 17:06 joal: Restart wikidata-specialentitydata_metrics-coord and wikidata-articleplaceholder_metrics-coord
- 16:59 joal: Restart mobile_apps-uniques-[daily|monthly]-coord
- 15:49 milimetric: deployed Refinery
- 07:41 elukey: Restart mediacounts-archive-wf-2017-04-19 in Hue (Java Heap space issue)
2017-04-12
[edit]- 09:53 elukey: stop Clickhouse on druid100[123]
2017-04-07
[edit]- 08:30 joal: Insert fake test data in aqs pagecounts endpoint to set monitoring back to non-alarm state
2017-04-05
[edit]- 16:03 elukey: restarted webrequest-load-wf-text-2017-4-5-14
- 15:13 elukey: removed /etc/cron.daily/blogreport from eventlog1001 (manual backup in /home/elukey)
- 13:24 ottomata: deployed slightly improved eventlogging_sync.sh script for on db1047 and dbstore1002
2017-04-04
[edit]- 19:50 ottomata: beginning jessie reimage for analytics105[56]
- 18:13 ottomata: starting jessie upgrade of analytics105[34]
- 17:26 joal: Restart mediawiki-history-denormalize-wf-2017-04
- 16:17 joal: Restart webrequest-load-wf-text-2017-4-4-14
- 08:14 elukey: restarted webrequest-load-wf-text-2017-4-4-6
2017-04-03
[edit]- 18:24 nuria: starting replication back on Eventlogging 1002/1047/1046
- 18:15 ottomata: dropping EL tables with really old data
- 12:53 elukey: restart webrequest-load-wf-upload-2017-4-3-11
- 11:49 elukey: manual run of sudo -u stats /a/refinery-source/guard/run_all_guards.sh --rebuild-jar
- 11:38 joal: Restart corrected mediawiki-history oozie job
- 11:30 joal: Deploying refinery to HDFS
- 11:25 joal: Deploying refinery
- 10:35 joal: Deploying refinery-source to archiva
2017-04-01
[edit]- 07:26 joal: Kill old cassandra bundles and restart new one for project_v2 production codeb
2017-03-30
[edit]- 18:02 elukey: an1039 back up again after thermal paste applied
- 17:54 ottomata: stopping hadoop services on analytics1046 for jessie upgrade
2017-03-29
[edit]- 17:19 nuria: restarted EL on eventlog1001 with new changeset and tables renamed
- 17:13 nuria: deploying eventlogging latetst: 28740773cea545215ea610c8c3e1a3ba36ef5a6a (UA changes)
2017-03-28
[edit]- 14:30 elukey: analytics1028 back serving traffic - T159632
2017-03-27
[edit]- 16:05 joal: Relaunch corrected denormalize oozie job
- 14:06 elukey: restart hadoop-yarn-nodemanager on analytics1044
- 13:03 elukey: fixed permissions (hdfs:hdfs -> root:root for /var/lib/hadoop/data)
- 11:37 joal: Start manual sqoop for failed wikis (dawiki, cebwiki, srwiki)
- 07:34 elukey: re-run mediacounts-load-wf-2017-3-24-14 from hue
2017-03-23
[edit]- 20:17 ottomata: moved all analytics cluster cron jobs (camus and other) from analytics1027 to analytics1003: T159527
- 20:14 ottomata: earlier today: upgraded from cdh5.2 to cdh5.10 on analytics1030, somehow we missed it! :o
- 11:30 joal: Restart webrequest-load-wf-maps-2017-3-23-10
2017-03-21
[edit]- 15:33 ottomata: restarting eventlogging client side processors with ImageMetrics blacklist change
- 09:55 joal: Reset hdfs folders and hive tables and partitions for productionisation of mediawiki history
- 09:52 joal: Restart webrequest-load bundle to pick up new pageview definition (2017-03-21T09:00Z)
- 05:57 joal: Restart cassandra-hourly-wf-local_group_default_T_pageviews_per_project-2017-3-20-23
- 05:54 joal: Deploying refinery
2017-03-20
[edit]- 17:47 joal: Deploy refinery-source to archiva
- 13:21 elukey: restarted pageview-hourly-wf-2017-3-20-11
- 07:14 elukey: restarted webrequest-load-wf-misc-2017-3-20-3
2017-03-18
[edit]- 19:04 joal: restart mediacounts-load-wf-2017-3-18-15 and mediacounts-load-wf-2017-3-18-16
- 12:39 joal: Restart webrequest-load-wf-maps-2017-3-18-11
- 08:13 elukey: restarted 18 Mar 2017 03:00:00 webrequest-load-maps
2017-03-17
[edit]- 14:24 elukey: analytics1044 back in the cluster
- 12:27 elukey: restarted webrequest-load-wf-text-2017-3-17-10
- 10:51 joal: restarted mediacounts-archive-wf-2017-03-16
2017-03-16
[edit]- 10:51 fdans: deploying aqs to production
2017-03-15
[edit]- 16:07 elukey: Wiped AQS Beta cassandra cluster
2017-03-14
[edit]- 18:43 nuria: rolling back prior eventlogging deployment, userAgent column is restricted to 191 chars, needs to be bigger or UAs are truncated
- 14:14 elukey: analytics1043 back in service
- 12:53 elukey: restarted webrequest-load-wf-upload-2017-3-14-11
- 12:53 elukey: restarted webrequest-load-wf-text-2017-3-14-11
- 06:53 elukey: re-run mediacounts-archive-wf-2017-03-13 from Hue (OOMs in the stderr)
2017-03-13
[edit]- 14:36 elukey: analytics1042 back among the Hadoop workers
- 08:53 elukey: set innodb_buffer_pool_size=2048 for mysql on bohrium (Piwik)
2017-03-12
[edit]- 22:24 elukey: restarted webrequest-load-text 12 Mar 2017 16:00:00 and 17:00:00
- 22:24 elukey: stopped yarn nodemanager on an1028
- 22:17 elukey: restarted webrequest-load Maps 12 Mar 2017 14:00:00
- 07:11 elukey: re-set SET GLOBAL max_connections=300 on bohrium's mysql (got lost after the restart)
2017-03-10
[edit]- 15:39 elukey: applied innodb_buffer_pool_size = 512M and restarted mysql on bohrium
- 10:54 elukey: executed set global innodb_flush_log_at_trx_commit=2; on bohrium as test
2017-03-09
[edit]- 14:17 joal: restart failed webrequest load [upload maps misc] 2017-03-09T09:00Z
- 11:04 elukey: an1041 yarn nodemanager back running
- 10:31 elukey: analytics1041 yarn nodemanager stopped, chowning to yarn:yarn all the perms in /var/lib/hadoop/data/X/yarn dirs
- 10:09 elukey: restarted yarn-nodenamanger on analtycs1040
- 09:52 elukey: restarted Mar 2017 02:00:00 webrequest-load-text (second time)
- 08:57 elukey: re-running webrequest-load-text failed jobs too via Hue
- 08:43 elukey: re-run via Hue the failed upload-load job
- 08:39 elukey: re-run all the failed misc webrequest-load oozie jobs (total of four)
- 08:28 elukey: re-run 186-09 Mar 2017 00:00:00 (webrequest-load-maps) on Hie
2017-03-07
[edit]- 15:20 joal: deploying aqs in prod
- 14:44 joal: Deploy AQS on beta
- 12:52 elukey: analytics1040 back in service
- 12:50 elukey: restarted webrequest-load-wf-text-2017-3-7-9 from Hue (oozie id: 0010151-170228165458841-oozie-oozi-W mapred that failed: job_1488294419903_24496)
2017-03-03
[edit]- 11:29 joal: Restart 3 oozie spark jobs
- 11:02 joal: Deploying refinery after having break stat1002 :(
- 10:32 joal: deploying refinery
- 09:43 joal: Deploying refinery-source v0.0.42 using jenkins
2017-03-02
[edit]- 18:22 ottomata: deleteing and recreating oozie share lib
- 18:15 joal: Restarting webrequest load for tect 2017-03-02T15:00Z
- 14:27 joal: restart mediacounts job starting 2017-03-01T11:00Z
2017-03-01
[edit]- 14:41 joal: Deploying refinery onto hdfs (before restarting jobs)
- 14:38 joal: Restart all hdfs oozie jobs with 2048M launcher memory (using script)
- 10:16 joal: Kill and restart webrequest-load-maps coordinator checking for new oozie_loader_memory parameter (starting from 2017-02-28T18:00 - 2g launcher memory)
- 09:39 joal: Kill and restart webrequest-load-maps coordinator checking for new oozie_loader_memory parameter (starting from 2017-02-28T18:00)
- 07:17 elukey: restarted manually the browser-general-coord failed jobs
- 07:13 elukey: restarted manually the pageview-hourly-coord failed jobs
- 07:09 elukey: restarted manually the pageview-druid-monthly-coord (february job failed)
- 07:06 elukey: restarted manually via Hue UI the webrequest-load-coord-misc failed jobs
- 06:59 elukey: restarted manually via Hue UI the webrequest-load-coord-maps failed jobs
2017-02-28
[edit]- 18:03 joal: restart pageview oozie job for 2017-02-28T12:00
- 17:53 elukey: restarted via Hue Feb 2017 14:00:00 webrequest-load-coord-misc/maps
- 14:02 joal: Suspend mediawiki-load jobs as well (forgot about those)
- 13:31 joal: Suspend webrequest-load bundle for CDH upgrade
- 13:30 elukey: stopping camus as prep step for the CDH upgrade
2017-02-23
[edit]- 12:18 joal: Restart cassandra-coord-pageview-per-project-hourly 2017-02-23T07, 08, 09 to recover from cassandra issue - Worked !
- 11:19 joal: Restart cassandra-coord-pageview-per-project-hourly 2017-02-23T07 and 08 to recover from cassandra issue
2017-02-22
[edit]- 08:06 elukey: restart Hue on an1027 for openssl upgrades
2017-02-16
[edit]- 13:22 elukey: updated firewall rules for Analytics VLAN
2017-02-15
[edit]- 13:55 elukey: disabled apache mod_deflate on bohrium (piwik test)
- 09:01 elukey: restarted Piwik with bulk_requests_use_transaction=0 to try to fix the SQL deadlock issue (https://github.com/piwik/piwik/issues/6398#issuecomment-91093146)
2017-02-13
[edit]- 21:38 elukey: Restarted webrequest-load-coord-upload 19:00 - failed and Hue returning 500s
2017-02-11
[edit]- 00:13 joal: Restartedwebrequest-load-wf-text-2017-2-10-20
2017-02-10
[edit]- 09:53 elukey: re-enabled oozie bundles after maintenance
- 09:51 elukey: restarted Hive-* and oozie on analytics1003
- 09:40 elukey: suspending oozie bundles to allow oozie/hive maintenance
2017-02-09
[edit]- 13:02 mforns: Restarted webrequest-load-bundle and pageview-hourly-coord
- 12:46 mforns: Deployed refinery using scap, then deployed onto hdfs
- 12:00 elukey: added Marcel as superuser in Hue
- 11:56 elukey: stopped webrequest-load-bundle from hue
- 11:06 mforns: Deployed refinery-source using jenkins
- 10:48 elukey: restarting druid daemons for Java upgrades
- 10:05 elukey: re-enabled oozie bundles after maintenance
- 10:04 elukey: performed master failover from an1001 to an1002 (and vice-versa) for java upgrades
- 10:04 elukey: restarted oozie, hive-server and metastore for java upgrades
- 09:49 elukey: suspended oozie bundles temporarily to allow graceful restarts
2017-02-08
[edit]- 18:05 ottomata: restarting pivot
- 17:52 ottomata: restarting pivot
- 15:35 elukey: restarted all the failed oozie cassandra load jobs
2017-02-07
[edit]- 20:24 joal: Resubmit cassandra-coord-pageview-per-project-hourly for 2017-02-07T18:00
- 14:36 elukey: restarted webrequest-load-wf-text-2017-2-7-13
2017-02-04
[edit]- 13:18 joal: Restarted mediacounts-archive job for day 2017-02-03 (had failed)
2017-02-02
[edit]- 12:07 joal: Restarted daily and monthly pageview druid loading jobs
- 12:03 joal: Deployed refinery to correct bug introduced in https://gerrit.wikimedia.org/r/#/c/335067/
- 10:13 joal: Killed-Restarted last access uniques monthly jobs to pick up new config -0097552-161121120201437-oozie-oozi-C
2017-02-01
[edit]- 19:01 joal: Killed-Restarted Mobile apps Uniques monthly jobs to pick up new config - 0096638-161121120201437-oozie-oozi-C
- 18:47 joal: Deploy refinery for uniques monthly patches
- 17:27 joal: Restarting 2 webrequest-load text jobs that failed during NM restart (2016-02-01T11:00 and T13:00)
- 13:12 elukey: restarted pageview-druid-monthly-coord and pageview-druid-daily-coord oozie coordinators after deployment
- 12:17 elukey: deployed Refinery via scap and then executed the hdfs copies on stat1002
2017-01-31
[edit]- 16:11 elukey: started Cassandra nodetool cleanup for aqs1007-a
- 16:04 elukey: started Cassandra nodetool cleanup for aqs1004-b
- 08:31 elukey: started Cassandra nodetool cleanup for aqs1004-a
2017-01-26
[edit]- 19:20 joal: Restart webrequest-lood-coord-text 2017-01-26T15:00 after cluster shake
- 19:18 elukey: restored an1001 as RM and HDFS master
2017-01-24
[edit]- 21:30 ottomata: restarted hadoop-mapreduce-historyserver on analytics1001. it died due to OOM
2017-01-22
[edit]- 13:27 joal: Rerun pageview-druid-daily-wf-2017-1-20 trying to see if it fixes automagically
2017-01-19
[edit]- 15:51 joal: Launched 0080172-161121120201437-oozie-oozi-B to recover from missing webrequest-load 2017-01-18 19:00 with a correct setup this time
- 15:39 joal: Launched 0080149-161121120201437-oozie-oozi-B to recover from missing webrequest-load 2017-01-18 19:00
2017-01-17
[edit]- 11:16 joal: Remove mediawiki-history-beta datasource from druid
- 09:51 elukey: restarted mediacounts-archive-wf-2017-01-16
2017-01-11
[edit]- 19:23 joal: Start mediawiki history reconstruction job on newly sqooped data
- 18:25 joal: Replace /wmf/data/raw/mediawiki/tables/ with newly sqooped data
2017-01-10
[edit]- 15:30 joal: Restart 0024519-160420145651441-oozie-oozi-C for day 2017-01-09 to see if it fails again
2017-01-06
[edit]- 20:35 joal: Launched 0063574-161121120201437-oozie-oozi-C to cover for upload-2017-01-06-[16-17]
- 19:04 elukey: started 0063446-161121120201437-oozie-oozi-C to re-run upload-2017-1-6-17