Wikimedia Cloud Services team/goals/2023-24
Appearance
Q3/Q4 (Jan 2024 to June 2024)
[edit]Goals
[edit]Incomplete goals from Q1+Q2
[edit]- Toolforge
- phab:T332955 [maintain-dbusers] Generate prometheus metrics
- Cloud VPS
- phab:T328502 Move WMCS off of Icinga and introduce alertmanager
- phab:T341060 openstack eqiad1: introduce cloud-private and cloudlb
- phab:T309789 [ceph] Upgrade hosts to bullseye
- phab:T306820 [ceph] Upgrade to v16
- phab:T297083 [ceph] Getting rack level HA
- Data Services
- phab:T291782 Migrate largest ToolsDB users to Trove
New goals for Q3/Q4
[edit]- Toolforge
- phab:T314664 Toolforge: Decommission the Grid Engine infrastructure
- phab:T194332 Some form of push-to-deploy on Toolforge
- phab:T311897 Migrate Toolforge off of Debian Buster
- phab:T336668 [harbor] Create backups and/or replication
- phab:T350687 [harbor] Move harbor data to object storage service
- phab:T356301 [harbor] Deploy with Helm
- Cloud VPS
- phab:T351450 Migrate Cloud VPS puppet infrastructure to Puppet 7
- phab:T356287 Upgrade to openstack version Bobcat
- phab:T356291 Reliable Trove backups
- phab:T353356 Magnum PoC with RelEng
- phab:T347490 [wmcs-cookbooks] Downtime alerts from cloudcumins
- Data Services
- phab:T352206 [toolsdb] Upgrade to MariaDB 10.6
- phab:T344717 [toolsdb] test creating a new replica host
- phab:T344719 [toolsdb] test failover procedure
Q1/Q2 (Sep 2023 to Dec 2023)
[edit]Availability
[edit]- Komla out all Quarter
- Arturo out all of August
- David out for half of July, one week of August
Goals
[edit]- Toolforge
- phab:search/query/WVN9f5IZgGkm/ Continue work on Envvars service (David) Done
- phab:T335249 Toolforge build service beta round 2 (David) Done
- phab:T334081 NFS-free webservice (David/Taavi) Done
- phab:T267374 Migration of projects from grid (Mostly communication, not active hand-holding this quarter) Done
- phab:T341084 Continuous delivery (gitlab pipeline things) (Arturo/David) Done
- Organize toolforge-wide workgroup to replace subtask-specific workgroups. (David) Done
- Reduce and define the UI surface of the toolforge platform via CLI and API definitions. (David + Taavi + Arturo)
- phab:T298005 Kubernetes version updates (Taavi + Arturo) Done
- Superset
- phab:tag/superset.wmcloud.org/ Continue operational exploration and responding to feedback (Rook) Done
- phab:tag/superset.wmcloud.org/ Continue operational exploration and responding to feedback (Rook) Done
- Cloud-vps + infra
- phab:T341285 Openstack Upgrade to A (Francesco + Andrew) Done
- phab:tag/openstack-magnum/ Magnum
- phab:T328711 Experiment with magnum-ui, decide w/not it’s useful in current state (Rook + Andrew) Done
- Automated testing (Rook and/or Andrew)
- phab:T328712 Prepare service for non-wmcs users (Rook + Andrew)
- Trove
- Automated testing (Rook and/or Andrew)
- phab:T337396 Investigate possible postgres improvements (Francesco)
- Continue to shepherd upstream fixes (Andrew)
- NFS
- phab:T292546 Continue to improve/stabilize/expand cinder backups (Andrew) Done
- phab:T333477 Review/improve observability (Andrew + Taavi) Done
- Network redesign
- phab:T338937 Migrate swift to new network setup in codfw1dev (Arturo and Andrew) Done
- phab:T341060 Design Eqiad implementation (Arturo)
- Ceph
- phab:T309789 Host OS upgrades (Francesco + Raymond)
- phab:T306820 Ceph Version upgrade (Post Host OS Upgrade) (David + Raymond)
- phab:T297083 Multirack HA (David)
- Team infra
- Cumin
- phab:T319401,phab:T325067 Finish work on dedicated cumin exec hosts (Francesco) Done
- Observability
- Continue to improve VM+prometheus+alert manager integration (Taavi)
- phab:T328502 Move more alerts from icinga to alert manager (Taavi)
- Enable silences from alerts.w.o for metricsinfra alerts (depends on o11y team upgrades to alertmanager)
- Cumin
- Hackathon Support (Slavina) Done
- phab:T327319 Documentation (Tricia, support from Francesco) Done