Deployment tooling/2013Sprint
Appearance
The DevOps Sprint 2013 was a project undertaken by the MediaWiki Core team to improve the deployment process and operational monitoring capabilities in use for the Wikimedia content projects and related infrastructure.
Post sprint retrospective notes
Sprint focus areas
[edit]Monitoring
[edit]Primary Goals:
- We shouldn't find out that various parts of our infrastructure are down because of a failed browser test (that only happen twice/day).
- Inform deployment rollback decisions based on pre/post deployment performance metrics
in progress
[edit]Graphite
[edit]- Ori: finish puppetization
- need ops review?
- Migrate to eqiad (ops or ori?)
- Aaron: enable deploy markings by default
Logstash
[edit]- Done Ops/Bryan: procure servers - rt ticket
- Done Ops/Greg: get Aaron and Bryan root on those servers - rt ticket
- Write more filter rules to parse various message formats
- Get a syslog feed from beta
- Done Package logstash jar for deployment
Ask Andrew Otto for advice based on Analytics packaging experience for Java projects
- Done Make a puppet module
- Note: matanya has said "i'm writing the logstash module" on IRC which may cover this and the packaging question
- We would probably still need to provide config for our usage
- Done Determine architecture for production deployment
- What log forwarding methods should be used?
- Udp2log is being used for now. Additional inputs will be added as needed
- How many Logstash instances?
- Starting with a 3 node cluster. Udp2log is only pointed at a single instance currently (logstash1001).
- HA strategy?
- Native Elasticsearch clustering. Kibana can be served from any of the 3 nodes. Udp2log input is currently a single point of failure.
- QOS terms?
- No QoS/ToS requirements have been established yet.
- What log forwarding methods should be used?
- Done tee the logs from MW to logstash (in addition to current fluorine logs)
Logging
[edit]- structured logging RFC
- Design proposed PHP API changes
- Clean proposal to remove/archive other examples
- Done Submit structured logging RFC
- Ori: Record sync/scap elapsed wall clock times in graphite
todo
[edit]- On-wiki documentation fatal and exception logging on the cluster - bug 52026
- Make l10nupdate emit useful log messages
- report to SAL with where the log lives
- Get Icinga to alert on important metrics
- Determine if icinga has the graphite plugin
- Have graphite use the prediction algo plugin for alertable metrics
- Figure out where platform alerts should go (mw-core initially?)
- Turn stuff on (for a set of metrics)
- Review of current metrics for alertable ones
- Find new relevant monitoring/alerting metrics
- Instrumentation in scap to relay more information out of the deployer's terminal
- dashboard grid of servers being updated, color indicating status of individual server's code version
- Log/show exceptions per file/extension
- simple text file initially
- Logstash in production
done
[edit]- Done (Hashar) Create logstash project in labs for testing
- Done (Bryan) Install logstash in labs for testing
Cache Improvements
[edit]in progress
[edit]- bug 22390 - When a commons image is updated, update the pages that use it
- Brian Wolff added one patch to deal with one aspect, still needing another half
todo
[edit]sorted (roughly) by priority
- bug 17577 - Include version in thumbnail URL
- bug 48835 - Separate Cache-Control header for proxy and client
- Implement thumbnail purging RfC
- bug 17577 - Image urls should have far future expires
- bug 46770 - Rewrite jobs-loop.sh in a proper programming language
done
[edit]- Done (Brad) bug 5382 - Queue refreshLinks jobs on template deletion
- Done (Tim) bug 27935 - Redirect to canonical encoding
Deployment
[edit]Delayed/de-prioritized relative to Monitoring
Stories:
- As a release manager, I want to eliminate manual steps that may be overlooked so that the Ops team doesn't get paged
Primary goals:
- Maintain reasonable usability
- Speed: no more than 10-15 minutes
- Graceful handling of unresponsive Apaches
- A workflow for security patches
- Better alerting / monitoring
- Smoke test
- Usability: commands used should map to logical activities rather than minutia
- Better SAL entries (include commit ranges)
- Maybe an easy way to get diffs of what was deployed
- Deal with umask and .bashrc insanity
ACTION ITEMS
[edit]- Audit of salt scripts for completeness - bug 43615
- Add rsync backend for Trebuchet - bug 54185
- Add submodule (and recursive submodule) support to Trebuchet - bug 51581
- (Ryan) Put new Trebuchet frontend on labs
- (Aaron) replace scap-recompile with a .deb package of texvc - bug 45076
- Enable Trebuchet logging to SAL/IRC
- Fix the db migration from small to medium - bug 56222
- Integrate work from Joey H into Trebuchet (git corruption fixing bug 51142)
- Test Trebuchet on production to dummy dir, point a testwiki to it
- Look at l10nupdate; DevOps Sprint 2013/l10nupdate dataflow