The DevOps Sprint 2013 was a project undertaken by the MediaWiki Core team to improve the deployment process and operational monitoring capabilities in use for the Wikimedia content projects and related infrastructure.

Sprint focus areas

Monitoring

Primary Goals:

We shouldn't find out that various parts of our infrastructure are down because of a failed browser test (that only happen twice/day).
Inform deployment rollback decisions based on pre/post deployment performance metrics

Done Ops/Bryan: procure servers - rt ticket
Done Ops/Greg: get Aaron and Bryan root on those servers - rt ticket
Write more filter rules to parse various message formats
Get a syslog feed from beta
Done Package logstash jar for deployment
- ~~Ask Andrew Otto for advice based on Analytics packaging experience for Java projects~~
Done Make a puppet module
- Note: matanya has said "i'm writing the logstash module" on IRC which may cover this and the packaging question
- We would probably still need to provide config for our usage
Done Determine architecture for production deployment
- What log forwarding methods should be used?
  - Udp2log is being used for now. Additional inputs will be added as needed
- How many Logstash instances?
  - Starting with a 3 node cluster. Udp2log is only pointed at a single instance currently (logstash1001).
- HA strategy?
  - Native Elasticsearch clustering. Kibana can be served from any of the 3 nodes. Udp2log input is currently a single point of failure.
- QOS terms?
  - No QoS/ToS requirements have been established yet.
Done tee the logs from MW to logstash (in addition to current fluorine logs)

structured logging RFC
- Design proposed PHP API changes
- Clean proposal to remove/archive other examples
- Done Submit structured logging RFC
Ori: Record sync/scap elapsed wall clock times in graphite

On-wiki documentation fatal and exception logging on the cluster - bug 52026
Make l10nupdate emit useful log messages
- report to SAL with where the log lives
Get Icinga to alert on important metrics
- Determine if icinga has the graphite plugin
- Have graphite use the prediction algo plugin for alertable metrics
- Figure out where platform alerts should go (mw-core initially?)
- Turn stuff on (for a set of metrics)
- Review of current metrics for alertable ones
- Find new relevant monitoring/alerting metrics
Instrumentation in scap to relay more information out of the deployer's terminal
- dashboard grid of servers being updated, color indicating status of individual server's code version
Log/show exceptions per file/extension
- simple text file initially
Logstash in production

bug 22390 - When a commons image is updated, update the pages that use it
- Brian Wolff added one patch to deal with one aspect, still needing another half

sorted (roughly) by priority

Delayed/de-prioritized relative to Monitoring

Stories:

As a release manager, I want to eliminate manual steps that may be overlooked so that the Ops team doesn't get paged

Primary goals:

Maintain reasonable usability
Speed: no more than 10-15 minutes
Graceful handling of unresponsive Apaches
A workflow for security patches
Better alerting / monitoring
- Smoke test
Usability: commands used should map to logical activities rather than minutia
Better SAL entries (include commit ranges)
- Maybe an easy way to get diffs of what was deployed
Deal with umask and .bashrc insanity