Jump to content

Talk:Wikimedia Performance Team

About this board

Previous discussion was archived at Talk:Wikimedia Performance Team/Archive 1 on 2016-03-10.

Bluerasberry (talkcontribs)

I made a layman fan page at meta:Grafana.wikimedia.org. Engage if you like. I wrote it out to clear my own thoughts and serve as a base for future conversation.

I am doing a conference presentation where I have some slides of Grafana charts and this documentation is a by-product of my own attempt to orient myself to what I was seeing.

Krinkle (talkcontribs)

Thanks!

ATDT (talkcontribs)

I am proposing adjusting the name and scope of the team. What do you think?

Proposal

  • Extend the charter of the Performance team to Performance and Availability, with a focus on site scalability, availability and performance of Wikimedia sites.

What this expanded team will do

  • Identify and correct significant risks to the availability and performance of the Wikimedia sites, either by itself or in co-operation with other teams.

What this team will not do

  • Initiate product-related improvements to the MediaWiki platform
  • Support third-party use of MediaWiki (with technical support, improved packaging, etc.)

Core metrics

  • Site performance, particularly time to first paint and time to save an edit
  • Uptime

Rationale

The Performance team is already a de facto Performance and Availability team. Recent projects have included:

  • fixing the backlogged job queue;
  • diagnosing HHVM-related memory leaks;
  • contributing etcd support to PyBal, the load balancer;
  • automating legacy code deprecation in MediaWiki;
  • writing Puppet modules for Apache, Redis, and HHVM
  • instrumenting MediaWiki internals and creating dashboards for key metrics

Ongoing projects include: - revamping and modernizing the image rendering stack; and - readying MediaWiki’s codebase for multi-datacenter deployment.

The existing Performance team contributes non-performance related features that enhance the availability and scalability of Wikimedia sites. Implementing this proposal would formalize a slightly expanded scope and raise the visibility of the team inside and outside of the Wikimedia Foundation. It will demonstrate the Foundation's dedication and focus to a core task: keeping Wikipedia and her sister sites up and running to the community.

Financial impact

Implementing this proposal would formalize the scope of an existing team without any headcount additions.

GDubuc (WMF) (talkcontribs)

Looks good to me :)

Jdforrester (WMF) (talkcontribs)

+1, sounds great.

Aaron Schulz (talkcontribs)

Seems OK, though "uptime" could use qualification (e.g. core MW services?).

Faidon Liambotis (WMF) (talkcontribs)

First of all: I'm sympathetic to a formal expansion of the scope of the performance team. It has been one of the most reliable teams during turbulent times for the the organization as a whole, and a tremendous help to the ops team. The team has gone above and beyond their responsibilities in their work, so it feels right to expand the charter of the team to cover the work it has been doing.

That said, I am a bit skeptical of the "availability" term and have been so independently of this proposal. The reasoning is two-fold: first, it feels like a lot of the work fits as awkwardly to availability as it does now to performance, e.g. the Thumbor work (work that needs to happen for a variety of reasons: availability, performance among them, but also security and product-focused features) or the MediaWiki URL routing work. Calling it "availability" feels like just listing another aspect of some of the great work that happens within that team, not unlike its performance aspect right now.

Second, this team's work on availability isn't cross-cutting right now and I wouldn't expect it to be given its current resourcing in headcount and skillset; the team hasn't worked on the availability of either other independent but critical infrastructure components (like e.g. RESTBase) nor it has dealt with availability plans at large (e.g. disaster recovery scenarios in case of partial or total failures of datacenters). Aaron's comment about the "uptime" KPI reflects on that as well.

My counter-proposal has been something along the lines of "core platform", but I realize that the term is both tainted by our past and also quite vague so it may not be the best of ideas :)

There are no older topics