Jump to content

Data Platform Engineering/Data Products/Decision Records/Metrics Platform Instrument Configurator

From mediawiki.org

Decision Record: Where Do We Build the Instrument Configurator?

  • Status: open for feedback
  • Recommender: Sam Smith
  • Decider: Will Doran
  • To be consulted:
    • Agreers:
      • Service Operations
      • DPE SRE
      • Security
    • Inputers:
      • MediaWiki Engineering
  • To be informed: SDS 2.5 Steering Committee
  • Date authored: 2024-01-29
  • Target decision date: 2024-03-15

Technical Story: [SPIKE] Draft of Mediawiki extension proposal for Metrics Platform Instrumentation (& Experimentation)

Keywords

[edit]

instrument, instrumentation, config, configuration, mediawiki, extension, service, services

Context and Problem Statement

[edit]

From SDS2.5.2: Instrumentation Configuration

If we create an instrumentation configuration system that has a low technical barrier to entry, we can

  • reduce the amount of engineering time required to create and manage instruments
  • decrease the time to data in order to enable confident data-based decision making across product decision makers.

In order to accomplish SDS2.5.2, we must first decide where to build the configurator. If we do not decide, then we cannot deliver on SDS2.5.2.

Decision Drivers

[edit]
  1. Data Products’ commitment to delivering a Minimum Lovable Product for SDS2.5.2 by EOQ3
  2. Performance. The instrument configurator must be able to deliver instrumentation configuration to all Wikipedia users
  3. Security
  4. The levels of support from Site Reliability Engineering (SRE), Service Operations (ServiceOps), Data Platform Engineering SRE (DPE SRE), and Release Engineering (RelEng)
  5. The long-term goal of have a unified tool for configuring instruments and experiments across all Wikimedia-hosted sites
  6. The competencies of Metrics Platform/Data Products engineers and the amount of engineering effort required
  7. Long-term maintainability.

Considered Options

[edit]
  1. Standalone application (app) in the WikiKube cluster with a bridging MediaWiki extension
  2. Standalone app in the dse-k8s cluster with a bridging MediaWiki extension
  3. MediaWiki extension

Recommendation

[edit]

Option 2: Standalone application in the dse-k8s cluster with a bridging extension.

The guidance we received from the Principal Architect for MediaWiki Platform aligns with this recommendation.

Positive Consequences

[edit]

Risks and Mitigations

[edit]
  • Performance and security reviews have difficult-to-predict outcomes. We will consult with the Security team early, providing them with the Decision Record and Design Document once an option is selected. We will also continue to engage proactively with the MediaWiki Platform Team to ensure alignment
  • We may have to request a concept review from SRE to help us work through the various failure scenarios of a distributed system. To mitigate this we will consult with Service Ops early and in the immediate term we will use the dse-k8s cluster, which will allow us to more efficiently prototype
  • We will be using the dse-k8s cluster for a non-mainstream use case, though one that still falls within its remit. Technically, this kind of service exists in a liminal space. Wikikube is geared specifically toward MediaWiki-related services and dse-k8s is geared towards data engineering work. This service could be argued to exist in either space. As part of the work, we will accept the potential work involved in porting from dse-k8s to Wikikube if that is required in the long term
  • In using dse-k8s, we accept that support is provided only during working hours and that availability is not guaranteed. In the event that dse-k8s or the app is unavailable, the bridging extension will disable all instruments
  • In using the Data Platform PostgreSQL cluster, we accept the cluster is not multi-DC. In order to mitigate the risk of performance degradation during the eqiad-codfw DC switchover, the app will maintain its own cache
  • All considered options require that we do some MediaWiki development. Two members of the team have a lot of experience in this domain and will be knowledge sharing throughout development with those team members who don’t have as much experience
  • We are aware that there is the beginning of an effort to abstract away the source of configuration from MediaWiki so that the current static configuration blob can be moved into an external data store. We accept that there will be work involved in porting the system to that paradigm

Standalone Application in the WikiKube Cluster

[edit]
Dimension Remarks Notes
Collaborating teams? SRE, ServiceOps, Data Persistence, RelEng ServiceOps have already signaled that they cannot provide support until April
Is the deployment path clearly defined? Yes
Estimated time to build and deploy? 10-12 weeks
Availability after deployment? High
Extension required? Yes T355599: Where Do We Build the Instrument Configurator
Does this affect build or commission? No Must
  1. Provide a module for OAuth or OpenID for auth (see T355599: Where Do We Build the Instrument Configurator)
  2. Provide an HTTP API to fetch details of active instruments
Programming languages available? JS (frontend), PHP,.JS (backend), Go, Python

Standalone Application in the dse-k8s Cluster

[edit]
Dimension Answer Notes
Collaborating teams? DPE SRE, RelEng
Is the deployment path clearly defined? Yes
Estimated time to build and deploy? 10-12 weeks
Availability after deployment? Variable The dse-k8s cluster has no SLO for availability.

dse-k8s is only deployed in the eqiad DC. If there were some catastrophic event in eqiad, then the service and its functionality will not be available until after eqiad were available again.

Extension required? Yes T355599: Where Do We Build the Instrument Configurator
Does this affect build or commission? No Must
  1. Provide a module for OAuth or OpenID for auth (see T355599: Where Do We Build the Instrument Configurator)
  2. Provide an HTTP API to fetch details of active instruments
Programming languages available? JS (frontend), PHP,.JS (backend), Go, Python, Java

MediaWiki Extension

[edit]
Dimension Remarks Notes
Collaborating teams? Data Persistence, RelEng
Is the deployment path clearly defined? Yes
Estimated time to build and deploy? 8-9 weeks
Availability after deployment? High Two team members are already deployers and can onboard other team members.
Extension required?
Does this affect build or commission? Yes We cannot commission a third-party piece of software to act as the instrument configurator.
Programming languages available? PHP, JS

Storage

[edit]
Cluster Owner Notes
Main MariaDB cluster Data Persistence Multi-DC, writes in primary DC, reads possibly from both (needs app to be capable of that)
Analytics Meta MariaDB cluster DPE SRE In the eqiad DC. When codfw becomes the primary DC, our app would still be talking to the DB in eqiad, decreasing perceived performance for the app user
Data Platform PostgreSQL cluster DPE SRE See the above
Cassandra RESTBase cluster Data Persistence Multi-DC, read/writes to both DCs, eventually consistent

We predict a row size of 3.34 KiB and estimate that there will be on the order of 100s of rows.

Notes

[edit]

Concept, security, and performance reviews are required for all options.

Bridging Extension

[edit]

If we opt to deploy a standalone application, we must also build and deploy a bridging extension that adapts the output of the app to MediaWiki and gives access to its internals, configuration, and the extensions involved in instrumentation on the Wikipedias.

For example, the extension should be responsible for embedding the output of the app in a ResourceLoader config module. If it doesn’t, then the browser must make a request directly to the app to fetch instrument configuration, which would need signoff from SRE and Performance.

On the other hand, we could build the bridge inside of an already in-production extension, EventLogging. Data Engineering owns EventLogging so this would require a little coordination with them. However, it would be less flexible in the long-term.

Auth(n|z)

[edit]

If we opt to build and deploy a MediaWiki extension, then authn and authz will be implemented using MediaWiki’s user rights and groups subsystem.

If we opt to deploy an app, then it must support OpenID and/or OAuth.

OpenID will allow us to authenticate users and authorize users using the Wikitech account via CAS-SSO. This flow is familiar to users who have authenticated with various Wikimedia-hosted apps, e.g. Superset and Turnilo.

OAuth will allow us to authenticate and authorize users using their Wikitech account. This flow is familiar for users who have authorized tools to interact with wikis on their behalf. To authorize users, we must grant them a custom MediaWiki user right which is allowlisted in the app. Fortunately, we can define rights in the bridging extension.

Purview

[edit]

If we opt to build and deploy a MediaWiki extension, then there will be a separate instance running on the Beta Cluster and on the production app servers. These instances will not be able to communicate with each other. The user will have to manually stitch together the history of an instrument.

Whereas if we opt to build an app, then we will have to make the app and bridging extension environment-aware. For example, the bridging extension running on the Beta Cluster would be configured to request instrument configurations for the Beta Cluster etc.

Flexibility

[edit]

In the near future, Data Products will also be deciding where to build an equivalent experiment configurator app – either a third-party solution or our own. Opting to deploy an app leaves us in a better position to explore third-party solutions to experiment configuration later with as little follow-up work as possible.

[edit]

Additional Comments

[edit]

[Anyone can add anything that doesn’t neatly fit into the format above]

Was this Helpful

[edit]

If you just read this TDR, please let us know how to improve this template.

[edit]
  • Did the TDR provide the information you were looking for?
  • How was it overdone?
  • How was it underdone?