Kubernetes SIG/Meetings/2023-03-28

Agenda:

Introductions for new members (if any)
SIG administrivia:
- Nothing to report
Misc:
- K8s SRE trainings
Topic:
- Kubernetes versioning discussion (add your questions/comments)
  - See current Kubernetes_Infrastructure_upgrade_policy
  - Do we allow clusters to diverge ? Or do we set a policy that everyone should be using the same versions of mostly the same components?
    - If the latter, how/when do we allow room for divergence?
    - How do we ensure that a team prioritizes upgrading their cluster?
      - What happens when they can’t?
    - How many parallel versions would we need to support?
  - How often do we want to upgrade?
    - Do we strive to always be in a supported release?
    - How many releases do we want to lag behind the latest and greatest? (upstream is releasing 1.27 and we are already 4 versions behind :D).
  - How do we deal with security upgrades (when we are supported?)
  - Do we need to support a different method of updating then re-init (for clusters that are not multi-DC and can’t be easily depooled)?
  - Should we try to update components “off-band” (like updating calico while not updating k8s to unclutter the k8s update)
- Off predefined topic
  - How far have we got with spark and user-generated workload on dse?
    - How can we manage secrets better for users sharing namespaces? Or should we not?
    - Draft document detailing progress with Spark/Hadoop/Hive already open for comments.
Minutes
- [SD] k3s for CI needs in WMCS. Pro for standardizing, but always having CI in mind and supporting it.
- [BT] There are many realms, production, WMCS and quite possibly more (see Fabian’s question about creating a cluster without the usual policies). While WMCS sets it’s own rules, production should converge the same policies for various reasons (e.g. security) Fabien’s question below: > Has there been discussions/plans to create a development environment, e.g. a kubernetes namespace with less stringent process, less resources, internal network etc, that would facilitate faster iteration / development for the users of the infra? I would (and will reply to Fabian) that the creation a new realm for this would be an extremely large project.
- [SD] Focusing on security is fine, but it adds to the workload of people that just want testing
- [LT] If we don’t want to have a set of versions (always talking about the production realm) to maintain, I wouldn’t diverge from supporting 2 versions. There are many codepaths to maintain.
- [JM] It’s quite a pain to have 2 versions already. And that’s for the migration, if we want to support them for longer it will add to the cost. Advocating for trying to stick all clusters to the same version.
- [JW] Would agree with the suggestion of trying to keep all of theproduction clusters at a similar version. Can upgrade less production-critical clusters first, followed by others. Also helping any teams who find themselves stuck on a particular version.
- [BT] Also supports the suggestion of keeping the production clusters at a similar version.
- [JM] The tricky part will be getting the teams to sign off on the particular upgrade cadence that we want, and distributing the work to the various engineers who will be preparing and carrying out the upgrades.
- [LT] Next question, how often should we upgrade and how many versions behind the latest should we be?
- [JM] Please refer to the upgrade policy. There is a tic-toc release model, with feature release -> stability release and we should always be trying to support the stability release. So it would be a nice goal to be able to target the stability release each time. And not allow ourselves to get too far behind the latest supported release.
- [AK] Agreed, it would be good always to stick with the ‘toc’ or stability releases. 1.27 is coming out soon and 1.23 won’t be supported for long. Resource is the question, so how do we actually manage to do this and how do we bring the ‘stragglers’ with us? No easy answer at the moment, although clearly some of it is down to management to allocate resource and build upgrades into annual planning.
- [LT] Agre, we should try to plan to upgrade each year. If possible. It should be made integral to annual planning. All managers who have teams who manage k8s clusters, should ensure that they allocate resources. How we lead and distribute the work will be the big problem. Will it always be right to have the ServiceOps team lead it? Having ML lead it would be alot of extra work for this team.
- [JM] It would be good to share the load of leading, but it was already a 2-team effort for this upgrade. It would be good to get more teams involved to share knowledge.
- [AK] +1 on rotating the order of leading on the upgrade between teams and getting more people/teams involved.
- Action Item for Everyone: speak to our respective managers and get sign-off for pre-planning this upgrade.
- [AK] Security release question - how do we apply these security upgrades. In the past we have just applied these in-cluster, rebuild packages, apply. It worked and by all accounts it still works. However, we haven’t done this recently. At some point it became more difficult and we haven’t done it much since then.
- [LT] Was it a stop-the-world upgrade, or rolling?
- [AK] It was generally rolling, nothing broke for a point release.
- [JM] This is what cloud providers do, upgrade control plane, then upgrade workers. “I’d say, let’s do it!”. Minor point releases should always be compatible. We can just use debdeploy, right?
- [LT] Yes, but we should not have to do the upgrade aggressively. We should also share the load of reading changelogs etc.
- [LT] Suggested Action point for everybody: read the changelogs and security bulletins.
- Should all wmf-k8s-sig members have a shared reading list, or things that we all know about, like mailing lists etc?
- [LT] Will managing point releases push us to supporting 3 or more releases
- [JW] version difference of 1 minor version is possible for most components (see https://kubernetes.io/releases/version-skew-policy/), so patch version difference should not be a problem
- [JM] I wouldn’t treat point releases differently in terms of puppet. Keeping multiple patch releases in apt.wm.org isn’t possible, so that makes patch roll-backs difficult (without apt-cache local)