Wikimedia Release Engineering Team/Deployment pipeline/2018-12-20
Appearance
Last Time
[edit]General
[edit]- "I survived another meeting that could have been an email"
- Â Strive for this not to be true
- Sometimes it is
- Let's be bold about skipping (but lets have an email version instead)
- topic: discuss Beta aka deployment-prep and k8s
- (couldn't find task that tracks this)
- but we have a patch instead: https://gerrit.wikimedia.org/r/c/operations/puppet/+/478637
- Marko: is beta important? If so something should be done. Have run into this since the last meeting
- Joe: I would like to move to a proper staging, do things have to in beta? probably not, but sometimes they are needed
- Marko: A higher percentage of the puppet code we use in production will become obsolete or maybe won't be in puppet
- Joe: whatever is needed to test a mediawiki extension is probably needed there (for services)
- Joe: hiera to run this image, use this config, etc. Want to avoid setting up a k8s cluster to run in beta that is different than production.
- Marko: try this next quarter for eventgate
- Joe: I want to try with mathoid Soonâ˘
- Track and install additional npm packages for all service container images
- SRE nodeX base image in the operations base image repo
- Joe: gc-stats?
- Marko: used for sending stats
- Dan: There's another way to do this with Blubber that doesn't involve relying on a "custom" docker-pkg base image
- Plus-side: more ability to make changes by services
- Downside: lots of blubber file duplication
- Allow access to blubberoid.discovery.wmnet:8748
- Summary so far:
- Use Cases: local development, CI, Pipeline building prod images
- Dan: single deployment for developers and CI and prod unifies environments (due to things like policy files [not currently in use, but is useful])
- Alex: WMCS can't talk to wmnet, so opening to WMCS == opening to everyone
- Alex: Blubber as a Service (BaaS) works counter to unified tooling because it neglects offline/low-bandwidth use-case
- Dan: I don't see how the service model works counter to unifying but perhaps it works counter to an offline dev-env requirement that we haven't named. That's fine but we shouldn't conflate the requirements
- Joe: people download and install so much untrusted binary garbage from github, we can distribute binaries for linux/windows/osx quite easily I think?
- Thcipriani: FWIW, we do have garbage binaries via `make release` target in repo posted on my people page currently, unfortunately: https://people.wikimedia.org/~thcipriani/blubber/
- Lars: it would be good to avoid perpetuating bad security practices? Sure, that wasn't my point :)
- Joe: I wouldn't point developers to BaaS, but it could be exposed publicly -- low potential for abuse
- Dan: I don't see much potential for abuse either
- liw: provides means to overload CPU, but maybe k8s policies can prevent this
- alex: we have policies already (1800 millicores is blubber's limit -- max found via testing with Jeena)
- alex: I worry that BaaS becomes critical to the tooling due to networking problems for developers -- non-up-to-date policy files, non up-to-date blubber
- fselles: could commit output from blubberoid into some repo
- joe: could generate lots of variants from one blubber file; I think we could tell folks to download the binary from gerrit
- Joe: I worry about a tool that creates images for the k8s cluster being dependant on the k8s cluster -- maybe we should use blubberoid in it's own container -- I need to think this through
- fselles: we have 2 clusters also we should trust blubberoid
- Joe: redudantcy probably means this is OK
- compromise: use blubber for local development, and blubberoid for CI
- TASK: releases should be updated automagically
- EPIC TASK: for developer tooling to keep track of this discussion
- Alex: we know the components of the developer tooling, but we don't know how those will fit together yet
- Summary so far:
RelEng
[edit]- Initial production image build fails helm test
- just check for .pipeline/helm.yaml
- thcipriani: ooooh...
- Cleaning old image tags (confuses version sort): https://people.wikimedia.org/~thcipriani/docker/wikimedia/mediawiki-services-mathoid/tags/Â ?
- Currently no way to delete images on the registry
Serviceops
[edit]- TEC3 goal posted by mark
- Lots of services for next quarter
- ORES is going to consume some time
- changeprop/cpjobqueue at least a month apart?
- Marko: need some clarification; I don't think that's doable. We need the same version of the kafka driver and since these share a repo not sure how to use node6 and node10 to have the same driver version
- Joe: cpjobqueue is scary to move (we can only handle a few minutes of outage for that service). If we need to stagger these repos we could maybe use the same deploy repo
- Joe: we could maybe use git branches, or something, for a short period: we shouldn't migrate both at the same time
- Marko: heuristics in terms of resource allocation for these services
- Alex: both of these are hard to benchmark
- fselles: try to assign similar resources and adjust using monitoring
- Joe: I think we're not limited to 1 process per pod
- Alex: we do not want to use ncpu