Agenda

Introductions (5m)
Expected Future Kubernetes Usage (5m per team)
- Overall Vision
- Next few big projects related or bound to benefit from increased Kubernetes support
Current Challenges (5m per team)
Next steps (10-15m)
- Structure of the WG?
  - Possible questions to answer (courtesy of Ben Tullis)
  - * What might we expect of such a group? e.g. Just help to review CRs?
  - Help to upgrade K8S? Help to migrate services? Policy definition and review?
  - * How might it work? Regular meetings? an IRC channel? Show & tell?
  - * How much time might participating in such a group realistically take,
  - when we are all working to capacity anyway?

Sample Answers (courtesy of Jesse Hathaway)

> * Who would choose to be a part of it in the first place? Who else apart

> from SREs should be involved?

Ideally I think it should be as open as possible, so anyone who is

interested, including community members can join.

> * What might we expect of such a group? e.g. Just help to review CRs?

> Help to upgrade K8S? Help to migrate services? Policy definition and review?

I would hope the group would do all of the things you mention, even just

having a list of people to add to CRs would be a great start.

> * How might it work? Regular meetings? an IRC channel? Show & tell?

I would love to have:

* An IRC Channel

* A Mailing list for discussions

* Documentation on how to interact with the group

> * How much time might participating in such a group realistically take,

> when we are all working to capacity anyway?

That is a really great question, Kubernetes is a huge fast moving beast,

I assume it will take quite a bit of our time!

Prior Work

The history of Kubernetes in WMF - SRE Summit 2022 - Recording - slides
The future of kubernetes in production - SRE Summit 2002 - slides - summary doc (TDB)

Meeting notes

What do people want to get from k8s
- Alexandros - Want to make a lot more functionality in the cluster to make it easier for people to work. Want to move some wikis to Wikikube and will have a lot of traffic there.
- Chris: Horizontal scaling, kubeflow, (compared to ORES)
  - “If you go at it alone (like ORES), when things break, it gets difficult. Should be part of larger infrastructure / integrated with SRE.
- Olja - DE manages 43 different services. Separate compute and storage and better utilize those resources.
- Emil - data infrastructure has been silos and difficult to use. Need self-service.
- Unifying a lot of the data infra and use cases would be good. Allowing us to be more flexible with experiments also good.
- Ben - The consumers are the general public and the deployers are various engineering teams. When we are looking at use cases its generally more internal teams are the end users and we are trying to replicate what they can currently do with YARN on k8s. Its a blend of data analytics jobs and user submitted jobs (product analytics data scientists submitting jobs). How best can we do configuration management, change management, and more efficiency?
WMCS was invited but couldn’t attend this meeting
Observability is also very interested.
WMCS have several use cases, they are interested in providing Kubernetes as a service, but also they have Kubernetes deeply embedded within (toolforge?) as a replacement for GridEngine. So we don’t share much of a codebase these days.
Alexandros: Kubernetes is hard. Janis and Luca have been instrumental in getting to where we are, but we do not have a lot of people to work on it. Much of the work to upgrade to v1.23
Ben: Could we clarify on the multiple version support? Janis: we support now and next right now, might do more in future, but versioned apt components already exist for multiple versions.

Chris: didn’t know we could have different versions. Would recommend to swarm all SREs on one upgrade a year.
Olja: There is an argument that we should do the hard thing often rather than letting them linger and build up
Janis: K8s upstream uses a tick-tock model so slightly faster than once a year for major updates
Luca: The earlier cluster like lift wing was really slow to stand up because a lot had to be learned but newer clusters lke Aux were stood up much faster. The more we do upgrades the faster and less painful it will be.
Ben: Is there a way we can streamline the upgrade process to make it easier?
Janis: We currently use the reimaging route because it seemed the safest.
Tobias: Hypothetical: One team needs a new version of istio and one team can’t use the new version of Istio, what do we do?
Olja: In my previous teams because you can spin up new hardware in the cloud, how does that work here?
Alexandros: We have defined how to do it for a long time now https://wikitech.wikimedia.org/wiki/Kubernetes/Kubernetes_Infrastructure_upgrade_policy
- We have timed it and the process takes wall clock two hours
Olja: What are the next steps with the 1.23 upgrade?
Emil: Is there are timeline or next steps?
Luca: Ideally every team dedicates some time next quarter for upgrade work
- The more people work the better
- Might be useful break this group into sub groups as needed
Yanis: Second what Luca has said. Probably not good to say a timeline yet but progress is being made.
Olja: Is the work fully scoped out? (people reply: yes)
Ben: Best way I can help is?
Alexandros: NEXT ITEM: HOW DO WE WANT TO STRUCTURE ALL OF THIS?
AndrewO: Deploying to k8s is actually harder right now. Lets be conscious of making it easier

Outcomes:

Group agreed to meeting as a large group once a month.
Group agreed to create smaller sub groups as needed that might have a different meeting cadence.
Group agreed to swarm more on the k8s upgrade.
Group agreed to make an IRC channel for daily work and discussion.