Analytics/Server Admin Log/Archive/2016
Appearance
2016-12-22
[edit]- 15:28 elukey: changed firewall rules to allow only $ANALYTICS_NETWORKS (rather than the broader $INTERNAL) for the Yarn UI http service (an1001) and the hive metastore (an1003)
2016-12-19
[edit]- 21:27 nuria: deployed analytics refinery, restarted webrequest load and pageview_hourly jobs
- 20:11 nuria: deployed analytics/refinery to cluster (2nd try)
2016-12-13
[edit]- 11:12 elukey: deleted /srv/stat1001 on stat1004
2016-12-09
[edit]- 14:32 joal: restarted eventlogging mysql consumer after DB restart
- 13:57 joal: Stopped EventLogging Mysql consumer for database restart
2016-12-08
[edit]- 18:37 ottomata: preferred-replica-election on analytics kafka cluster to bring 1012 back as leader for its partitions
- 18:15 ottomata: restarting broker on kafka1012 to repro T152674
2016-12-07
[edit]- 21:59 ottomata: restarting eventlogging again to pick up puppet changes to use kafka-confluent writer
- 19:39 ottomata: restarting analytics eventlogging to test out confluent kafka producer for processors
2016-12-05
[edit]- 11:02 joal: Killing wikidata-articleplaceholder_metrics job and restarting it starting Nov. 1st for code update
- 10:43 joal: Deploy refinery onto hdfs
- 10:35 joal: deploying refinery
2016-12-02
[edit]- 09:43 joal: Restarted yesterday failed oozie webrequest-load jobs (upload, text, misc, hours 21, 22,23)
2016-12-01
[edit]- 20:27 ottomata: bouncing kafka broker on kafka1018 to test config changes to eventlogging analytics kafka clients
- 20:25 ottomata: restarting eventlogging analytics processes again to pick up api_version change for consumers too
- 19:45 ottomata: restarting eventlogging analytics processes to pick up api_version kafka arg
- 08:02 elukey: added fi.wikivoyage to the pageview whitelist manually
2016-11-30
[edit]- 21:32 milimetric: restarted webrequest/load oozie bundle
- 21:17 milimetric: Deployed refinery using scap, then deployed onto hdfs
- 20:52 milimetric: Deployed refinery-source using jenkins
2016-11-25
[edit]- 09:16 elukey: resumed oozie bundles and camus crontab after maintenance
- 08:49 elukey: stopping oozie and camus as prep-step for Yarn/HDFS master failover (remaining hosts with old openjdk)
2016-11-21
[edit]- 20:23 nuria: restarted webrequest jobs 0000454-161121120201437-oozie-oozi-B
- 19:09 nuria: deploying improvements to oozie job alarms to analytics cluster https://gerrit.wikimedia.org/r/#/c/319582/
- 18:45 joal: Launch 0000336-161121120201437-oozie-oozi-C to cover for webrequest-load-text-2016-11-21-13
- 17:28 elukey: unmasked kafka* on kafka1022 after disk swap
- 12:07 elukey: restarted oozie bundles via hue after oozie/hive restart
- 11:45 elukey: stopped all the oozie bundles via Hue as prep-step for Hive/Oozie daemons restarts on an1003
- 10:15 elukey: created 0039892-161020124223818-oozie-oozi-C and 0039895-161020124223818-oozie-oozi-C for webrequest-load-wf-(text|upload)-2016-11-21-8
2016-11-17
[edit]- 00:03 joal: Launched 0034429-161020124223818-oozie-oozi-C to cover for wf-text-2016-11-16-21
2016-11-12
[edit]- 19:23 joal: Launch 0028421-161020124223818-oozie-oozi-B to cover for webrequest-load hours 19-20 missing on 2016-11-10
2016-11-10
[edit]- 19:59 nuria: deployed v0.0.37 of refinery to hdfs
- 18:22 nuria: deployed v0.0.37 of refinery-source https://gerrit.wikimedia.org/r/#/c/320797/
2016-11-08
[edit]- 12:33 joal: Deploying refinery for patching pageview whitelist
2016-11-07
[edit]- 09:45 elukey: started 0022558-161020124223818-oozie-oozi-C to rerun wf-text-2016-11-7-07
- 08:00 elukey: started 0022441-161020124223818-oozie-oozi-C to rerun wf-text-2016-11-7-04 -> 06
- 04:53 joal: started 0022249-161020124223818-oozie-oozi-C to rerun wf-text-2016-11-7-00 -> 03
2016-11-06
[edit]- 19:50 joal: started 0021806-161020124223818-oozie-oozi-C to rerun wf-text-2016-11-6-16
- 17:39 elukey: started 0021694-161020124223818-oozie-oozi-C to rerun wf-text-2016-11-6-15
- 09:27 joal: started 0021136-161020124223818-oozie-oozi-C to re-run wf-text-2016-11-6-01 -> 07
2016-11-05
[edit]- 18:05 joal: started 0020254-161020124223818-oozie-oozi-C to re-run wf-text-2016-11-5-10
- 08:47 joal: started 0019693-161020124223818-oozie-oozi-C to re-run wf-text-2016-11-5-00 -> wf-upload-2016-11-5-07
- 08:45 joal: started 0019686-161020124223818-oozie-oozi-C to re-run wf-text-2016-11-4-19 -> wf-upload-2016-11-4-20
2016-11-04
[edit]- 08:45 elukey: started 0018557-161020124223818-oozie-oozi-C to re-run wf-upload-2016-11-4-6
- 08:45 elukey: started 0018549-161020124223818-oozie-oozi-C to re-run wf-upload-2016-11-4-2 -> wf-upload-2016-11-4-4
2016-11-02
[edit]- 19:43 ottomata: manually stopped an old wikistats_git pageviews cron in spetrea's crontab on stat1003. no output from it since 2013, and spetrea doesn't really have an account
2016-11-01
[edit]- 17:52 joal: Deploying refinery
- 14:45 joal: Restart webrequest load job to apply
- 14:33 joal: deploying refinery onto the cluster
- 14:00 ottomata: restarting pivot
2016-10-31
[edit]- 17:09 ottomata: bouncing eventlogging
- 17:00 ottomata: kafka preferred replica election on main-eqiad kafka cluster to promote kafka1003 as leader for its preferred partitions
- 14:49 ottomata: adding kafka1003 in as replicas for active main-eqiad topics
- 14:12 ottomata: adding kafka1003 as kafka broker in main-eqiad cluster
- 14:00 joal: deploy refinery
2016-10-28
[edit]- 13:04 elukey: oozie firewall rules changed - nowonly the analytics network is allowed
- 00:19 bd808: Testing logging to mw.o SAL via stashbot
2016-09-23
[edit]- 09:06 elukey: reboot eventlog2001.codfw.wmnet for kernel upgrades
- 08:45 elukey: upgrading varnishkafka to 1.0.12-1 in cache:misc
- 08:32 elukey: upgrading varnishkafka to 1.0.12-1 in cache:maps
2016-09-22
[edit]- 15:30 elukey: analytics1001 is back Yarn/HDFS master
- 13:16 elukey: previous comment was meant to be read as "set a permanent read only = false"
- 13:16 elukey: set read_only = false (on startup) for the analytics1003's mariadb instance
- 13:12 elukey: restarted oozie jobs for 2016-9-22-6
- 12:50 elukey: varnishkafka 1.0.12 installed in cache:upload ulsfo and eqiad
- 11:04 elukey: re-enabling oozie and camus after cluster reboots
- 10:57 elukey: rebooted analytics1001
- 10:55 elukey: Failover from analytics1001 to analytics1002 as prep step for 1001's reboot
- 10:28 elukey: setting global read_only = 0 to analytics1003 mariadb instance
- 10:04 elukey: rebooted analytics1003 (oozie, hive-metastore and hive-server2 daemons affected)
- 09:51 elukey: executed aptitude remove apache2 on analytic1027 (we use nginx in front of hue, apache steals port 8888 to hue and it does not start)
- 09:49 elukey: suspended all oozie bundles as prep step to reboot analytics1003
- 09:39 elukey: rebooted analytics1027
- 09:14 elukey: varnishkafka 1.0.12 installed in cache:upload codfw
- 08:52 elukey: varnishkafka 1.0.12 installed in cache:upload esams
- 06:45 elukey: stopped camus on analytics1027 and suspended webrequest-load-bundle via Hue (prep step for reboots)
2016-09-21
[edit]- 17:43 elukey: installed varnishkafka 1.0.12-1 on cp3034.esams
- 06:25 elukey: removed aqs100[123] from live traffic
2016-09-20
[edit]- 17:03 elukey: aqs100[56] added to LVS and serving live traffic
- 16:22 elukey: restarting cassandra on aqs1005
- 07:41 elukey: restart cassandra on aqs100[456] for T130861 - only aqs1004 is taking live traffic
2016-09-16
[edit]- 09:24 elukey: added aqs100[456] to conftool-data (not pooled but the load balancer is doing health checks)
2016-09-14
[edit]- 16:07 elukey: cassandra on aqs100[123] restarted for T130861
2016-09-12
[edit]- 18:54 ottomata: reenabled camus with new version of camus checker jar
- 18:41 ottomata: disabled camus crons on analytics1027
- 09:48 elukey: restarted pivot on a tmux session on stat1002 since it died
2016-09-09
[edit]- 08:32 elukey: executed apt-get clean on analytics1032 to free space
2016-09-08
[edit]- 15:37 ottomata: deploying refinery with v0.0.35 of refinery source
- 09:54 elukey: removed duplicates from the hdfs crontab on analytics1027
2016-09-05
[edit]- 13:23 elukey: removed the unsued analytics-root group from puppet
2016-08-31
[edit]- 09:18 elukey: deleted /var/www/limn-public-data/caching on stat1001 to free space
- 09:10 elukey: Moved stat1003:/srv/reportupdater/output/caching to /home/elukey/caching as temporary measure to free space on stat1001
- 07:54 elukey: removed /home/home dir from stat1001 to free space
- 07:52 elukey: removed /home/home/home dir from stat1001 to free space
2016-08-30
[edit]- 17:45 joal: Drop pageviews test datasource in druid
2016-08-26
[edit]- 13:52 elukey: re-enabling camus and oozie
- 13:48 elukey: restarted hadoop-hdfs-namenode on analytics1002 (1001 back to active)
- 13:45 elukey: restarted yarn-resourcemanager on analytics1002 (1001 back to active)
- 13:33 elukey: restarted hadoop-hdfs-namenode on analytics1001
- 13:30 elukey: restarted yarn-resourcemanager on analytics1001
- 13:09 elukey: oozie, hive-server and hive-metastore restarted for security upgrades
- 11:32 elukey: stopped camus on analytics1027
- 11:31 elukey: suspended all the oozie bundles via Hue
2016-08-12
[edit]- 14:40 elukey: created the 'aqsloader' user on aqs100[456] cassandra instances following https://wikitech.wikimedia.org/wiki/User:Elukey/Analytics/AQS_Tasks
- 14:09 joal: Deploy refinery on hadoop
- 13:51 joal: Deploy refinery from tin
2016-08-10
[edit]- 15:41 joal: Loading 2016-07 in new aqs
2016-08-09
[edit]- 17:48 ottomata: restarting eventlogging with kafka-python 1.3.1 (and bugfix), will be testing kafka broker restarts again today
- 13:12 elukey: deploying the aqs cassandra user to aqs100[123] (not using it in aqs-restbase yet)
- 13:10 elukey: deploying the aqs cassandra user to aqs100[456] (not using it in aqs-restbase yet)
2016-08-08
[edit]- 18:54 ottomata: restarting eventlogging with processors retries=6&retry_backoff_ms=200. if this works better, will puppetize.
- 18:30 ottomata: restarting kafka broker on kafka1013 to test eventlogging leader rebalances
- 15:13 ottomata: deploying eventlogging/analytics - kafka-python 1.3.0 for both consumers and producers
- 14:13 joal: Loading 2016-06 in clean new aqs
- 14:10 joal: Adding test data onto newly wiped aqs cluster
- 14:06 joal: Updating cassandra compaction to deflate on newly wiped cluster
2016-08-05
[edit]- 15:39 joal: Restart oozie jobs for druid loading from production refinery instead of joal
- 14:31 joal: Retrying deploying refinery from scap
- 13:51 joal: Stopping pagecounts-[raw|all-sites] oozie jobs (load and archive)
- 13:07 joal: Deploying refinery using scap
- 12:59 joal: Rolled back refinery interactive deploy
- 12:54 joal: Deploy refinery using brand new scap deploy !
- 07:42 elukey: ran apt-get clean on analytics1027 to free space
2016-08-04
[edit]- 19:50 ottomata: now running kafka-python 1.2.5 for eventlogging-service-eventbus in codfw, removed downtime for kafka200[12]
- 17:36 elukey: added the analytics-deploy key to the Keyholder for the Analytics Refinery scap3 migration (also updated https://wikitech.wikimedia.org/wiki/Keyholder)
- 17:29 elukey: deploying the refinery with scap3 for the first time on all nodes
2016-07-29
[edit]- 01:55 milimetric: limn1 disk full, no idea how to clean it because /public refuses to list its files or listen to me when I try to delete it
2016-07-28
[edit]- 17:37 ottomata: powercycling analytics1032
2016-07-26
[edit]- 10:13 joal: Re-deploying refinery after bug fix
- 09:26 joal: Deploying refinery
- 08:41 joal: Deploying refinery-source using Jenkins
2016-07-25
[edit]- 18:31 ottomata: upgrading kafka to 0.9 in main-codfw, first kafka2001 then 2002
2016-07-20
[edit]- 19:40 joal: Relaunch 2016-07-19 cassandra per-article-daily oozie job
- 15:45 elukey: executed https://phabricator.wikimedia.org/P3520 on aqs100[456] for both a/b cassandra instances
- 15:33 elukey: raising compaction throughput to 256 on aqs100[456]
2016-07-18
[edit]- 17:16 joal: Change compression from lz4 to deflate on aqs100[456]
- 17:16 joal: Change compression from lz4 to deflate
- 08:59 joal: deploy restabase on aqs100[23]
- 08:36 elukey: re-executed cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2016-7-16 (failed oozie job)
2016-06-08
[edit]- 08:45 elukey: removed temporary retention override for kafka webrequest_text topic (T136690)
- 08:17 elukey: lowering down webrequest_text kafka topic retention time from 7 days to 4 days to free disk space
2016-06-07
[edit]- 17:51 ottomata: restarting broker on kafka1020
- 10:10 elukey: hue restarted on analytics1027 for security upgrades
2016-06-06
[edit]- 19:16 ottomata: restarting kafka broker on kafka1020 to test python consumption client
2016-06-04
[edit]- 09:47 elukey: removed temporary Analytics Kafka upload retention override (T136690)
- 09:38 elukey: Lowering down temporarily the Analytics kafka upload retention time to 24h to free space (T136690)
2016-06-03
[edit]- 08:38 elukey: event logging restarted on eventlog1001
- 08:34 elukey: rebooting kafka1012 for kernel upgrades.
2016-06-02
[edit]- 19:53 ottomata: stopping kafka broker and restarting kafka1014
2016-06-01
[edit]- 18:16 ottomata: stopping kafka broker on kafka1018 and rebooting node
- 11:55 elukey: restarted EL on eventlog1001
- 11:51 elukey: rebooting kafka1022 for kernel upgrades
- 08:26 elukey: deleted very old kafka.log files in /var/log/kafka to free root space
- 07:54 elukey: EL restarted on eventlog1001
- 07:47 elukey: stopping kafka on kafka1020.eqiad and rebooting the host for Linux 4.4 upgrades
2016-05-27
[edit]- 11:28 elukey: restarted jmxtrans on kafka10* hosts
- 11:26 elukey: restarted jmxtrans on kafka1013
- 11:21 elukey: executed kafka preferred-replica-election on kafka1013
2016-05-25
[edit]- 14:24 joal: deploying aqs from tin
- 14:16 joal: Deploying aqs into aqs_deploy
2016-05-24
[edit]- 19:25 nuria_: deploying latest master to dashiki 08cc9a2545bcc0a183a3c00c18e81f21326a41b
- 12:56 elukey: EL restarted after kafka1013 node stop (kernel upgrades)
- 12:50 elukey: stopping kafka on kafka1013 and rebooting the host for kernel upgrade
2016-05-23
[edit]- 17:28 elukey: re-run from Hue webrequest-load-wf-(text|upload)-2016-5-23-13. The failures were likely caused by my global Yarn restart on the cluster.
- 17:20 elukey: oozie bundles re-enabled
- 14:58 elukey: suspended all the oozie bundles as prep step for https://gerrit.wikimedia.org/r/#/c/290252 (yes I know super paranoid mode on)
- 06:42 elukey: Removed Kafka temp. override for webrequest_upload retention.ms after freeing some disk space.
- 06:37 elukey: Set kafka retention.ms=172800000 for the topic webrequest_upload to free some disk space on kafka1022
2016-05-20
[edit]- 12:50 elukey: aqs100[123] restarted for openjdk upgrades
- 08:53 elukey: cassandra upgraded to 2.1.13 on aqs1003
- 08:30 elukey: aqs1002 migrated to cassandra 2.1.13
2016-05-02
[edit]- 18:30 joal: manually touch _SUCCESS file in hdfs://analytics-hadoop/wmf/data/raw/webrequest/webrequest_text/hourly/2016/05/02/14/ to launch refine process despites load job failure
- 17:38 elukey: removed out of service banner from dashiki dashboards
- 17:33 elukey: reverted Varnish config to return 503s for datasets and stats
- 12:14 elukey: deployed Varnish change to force HTTP 503 for datasets.wikimedia.org, stats.wikimedia.org, metrics.wikimedia.org as prep-step for OS reimage.
- 12:05 elukey: enabled maintenance banner to dashiki based dashboards via https://meta.wikimedia.org/wiki/Dashiki:OutOfService
- 11:21 elukey: deployed last version of Event Logging. Service also restarted.
2016-04-30
[edit]- 13:42 elukey: disabled puppet on analytics1047 and scheduled downtime for the host, IO errors in the dmesg for /dev/sdd. Stopped also Hadoop daemons to remove it from the cluster temporarily (not sure how to do it properly, will write docs).
2016-04-28
[edit]- 10:44 joal: deployed aqs on all three nodes (Thanks elukey !!!!)
- 09:03 joal: Deploying aqs on aqs1001
- 08:14 elukey: restarting kafka on kafka{1012,1014,1022,1020,2001,2002} for Java upgrades. EL will be restarted as well (sigh)
2016-04-27
[edit]- 15:47 elukey: restarted event logging on eventlogging1001
- 14:01 elukey: restarted Event Logging on eventlogging1001
- 13:53 elukey: restarted kafka on kafka1018.eqiad.wmnet for Java upgrades
2016-04-25
[edit]- 19:55 nuria_: deployed new vitalsigns code to https://vital-signs.wmflabs.org
- 17:43 nuria_: deployed new vitalsigns code to https://vital-signs.wmflabs.org
2016-04-22
[edit]- 09:23 moritzm: installing ircbalance bugfix updates (preventing massive logspam on some systems)
2016-04-20
[edit]- 16:06 elukey: camus re-enabled on analytics1027
- 13:54 elukey: puppet stopped on analytics1027 together with Camus (via crontab -e)
- 10:41 elukey: started rsync of /srv from stat1001 to stat1004 (/srv/stat1001)
2016-04-19
[edit]- 08:33 joal: deployed new refinery on hadoop
- 08:21 joal: deploying refinery from tin
2016-04-18
[edit]- 10:11 elukey: execute sudo eventloggingctl restart on eventlogging1001
2016-04-13
[edit]- 16:35 ottomata: rebuilding raid1 array on aqs1001 after hot swapping sdh
- 15:00 joal: restarting failed jobs
- 14:38 ottomata: restarting hadoop-yarn-nodemanager on all hadoop worker nodes one by one to apply increase in heap size
2016-04-11
[edit]- 11:52 joal: Restart refine job after deploy
- 10:30 joal: Deploying refinery on HDFS
- 10:21 joal: deploying refinery from tin
- 09:13 joal: Releasing refinery-source v0.0.30 to archiva
2016-04-08
[edit]- 10:09 joal: deploying aqs from tin on aqs1003
- 10:08 joal: deploying aqs from tin on aqs1002
- 10:03 joal: deploying aqs from tin on aqs1001
2016-04-07
[edit]- 22:58 nuria_: deployed browser-reports master branch to labs
- 19:34 ottomata: restarting eventlogging so it runs out of the scap deploy in eventlogging/analytics
- 10:21 elukey: nodejs-legacy upgraded too on all aqs nodes
- 09:43 elukey: aqs1002.eqiad.wmnet re-pooled, aqs1003.eqiad.wmnet de-pooled/re-pooled too (nodejs upgrade)
- 09:30 elukey: aqs1002.eqiad.wmnet de-pooled via confctl. Nodejs upgrade will follow.
- 09:18 elukey: re-added aqs1001.eqiad.wmnet to LVS pool via confctl
- 08:59 elukey: removed aqs1001.eqiad.wmnet from LVS pool via confd for nodejs upgrade
2016-04-06
[edit]- 14:04 elukey: ran nodetool repair system_auth on aqs1002.eqiad/aqs1003.eqiad
- 13:59 elukey: ran nodetool repair system_auth on aqs1001.eqiad
- 11:45 elukey: started nodetool repair on aqs1002 after running "ALTER KEYSPACE system_auth WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': 3 };"
2016-04-04
[edit]- 15:45 elukey: aqs1001 re-added to the aqs pool (nodejd NOT upgraded)
- 14:46 elukey: de-pooled aqs1001.eqiad from the confd pool for nodejs upgrade
- 10:42 elukey: re-pooled aqs1001.eqiad (no node upgrade, need more info about restbase)
- 09:53 elukey: de-pooled aqs1001.eqiad.wmnet as pre-step for nodejs upgrade
2016-04-01
[edit]- 13:38 joal: Deploying aqs in aqs1003 from tin
- 13:35 joal: Deploying aqs in aqs1002 from tin
- 13:23 joal: Deploying aqs in aqs1001 from tin
2016-03-31
[edit]- 20:01 ottomata: stopping eventlogging, uninstalling globally installed eventlogging python code, running puppet, restarting eventlogging from /srv/deployment/eventlogging/eventlogging
- 19:45 ottomata: merging puppet change to run eventlogging code out of deploy repo
2016-03-30
[edit]- 18:06 ottomata: repooling aqs1001
- 18:00 ottomata: depooling aqs1001
2016-03-29
[edit]- 13:27 joal: Update CirrusSearchRequestSet schema in hive
2016-03-24
[edit]- 18:29 elukey: camus and puppet re-enabled on analytics1027
- 18:27 ottomata: resuming suspended webrequest load and refine jobs
- 17:57 elukey: enabled Hadoop Master Node automatic failover on analytics1001/1002 (this time without fireworks).
- 17:09 ottomata: temporarily suspending oozie webrequest refine jobs
- 16:18 ottomata: suspending webrequest load job temporarily
- 16:15 elukey: disabled camus and puppet on analytics1027
- 13:16 elukey: camus and puppet re-enabled on analytics1027
- 09:56 elukey: Camus stopped on analitics1027 (puppe disabled too)
- 09:52 elukey: puppet disabled on analytics1001/1002 as pre-set to enable HDFS HA failover.
2016-01-21
[edit]- 16:35 ottomata: stopped eventlogging mysql consumers for long downtime: https://phabricator.wikimedia.org/T120187
- 16:20 ottomata: started eventlogging mysql consumers
- 15:59 ottomata: stopping eventlogging mysql consumers for https://phabricator.wikimedia.org/T123546
2016-01-20
[edit]- 18:30 mforns: deployed EL in production with removal of queue
- 17:37 mforns: restarted EventLogging because of Kafka consumption lag
2016-01-19
[edit]- 20:08 mforns: deployed eventlogging to deployment-eventlogging03 with removal of mysql consumer batch
2016-01-18
[edit]- 14:49 ottomata: restarting eventlogging to un-blacklist MobileWebSectionUsage
- 01:07 ottomata: restarted eventlogging again. A single raw client side processor consumer seemed stuck (according to burrow). seeing offset commit errors in logs.
2016-01-17
[edit]- 08:26 ottomata: restarting eventlogging to see if it'll help burrow reported kafka consumer lag
2016-01-14
[edit]- 22:29 YuviPanda: wikimetrics <whatever>
- 19:55 ottomata: restarted eventlogging_sync script to insert batches of 1000
2016-01-13
[edit]- 20:01 ottomata: dropped MobileWebSectionUsage_14321266 and MobileWebSectionUsage_15038458 from analytics-store eventlogging slave db
- 19:24 ottomata: restarting eventlogging to apply blacklist of MobileWebSectionUsage scheas