Original Google Doc with Decision Record

What are your constraints?

General Assumptions and Requirements	Source The person or document that this requirement comes from.
We will not fully solve comprehensiveness all at once. Instead, we will aim to increase comprehensiveness by targeting a subset of important use cases. We will initially target: Wikitext content (MCR?) Wikitext diffs Html content Page links changes Wikibase entity data Edit history of user Page redirects info Wikimedia Enterprise needs
This Decision Record specifically addresses getting more MediaWiki state into events. It does not include recommendations or solutions for building other event driven services. However it will make building services that use MediaWiki state in events possible.
Security Requirements Describe any security limitations or constraints for the proposed solutions.
We don’t currently have any access control over how engineers produce or consume event stream data. This will not be solved in this decision record, but we should be aware of this while solutioning.
Privacy Requirements Describe potential privacy limitations or constraints for the proposed solutions and how they will be mitigated.
Emitting state changes into streams means that those changes will persist immutably for the retention period of each stream. We should be careful with PII, and 1. Never expose it publicly and 2. Include a way to remove this data using Kafka compacted topics. Using compacted topics might be out of scope of this Decision Record, but should be considered.

Decision

Selected Option	Option 4: Streaming Service(s)
Rationale	Option 1. Do Nothing and Option 3. JobQueue can easily be ruled out. JobQueue was quickly dismissed by Petr as not reliable enough. Option 2. EventBus is a possibility, but really only if we only ever planned to add 1 or 2 new event streams. This leaves Option 4. Streaming Service(s). Doing this in a streaming service allows us to have a decoupled deployment from MediaWiki, and will be more flexible when adding new event streams. It also allows us to build more expertise and tooling for doing this for new products in the future.
Data	See Options below.
Informing	Andrew Otto and Luke Bowmaker will be informing and working with others on this. Our initial need for interfacing with others will be to vet any new event streams we design to be sure they meet potential needs.

Who	Andrew Otto
Date	2022-03-02

What are your options?

Option 1: Do Nothing
Description	Continue using the existent MediaWiki event streams for state transfer.
Benefits	No streaming services to maintain.
Risks	Teams that need state outside of MediaWiki will have to get it themselves, injecting latency and complexity into their data pipelines. Limits our ability to create timely and more relevant data products.
Effort	More for teams implementing services.

Testing
Option 2: MediaWiki directly produces more events
Description	MediaWiki directly emits more comprehensive event streams via EventBus extension.
Benefits	No streaming pipelines to maintain Easy to do now
Risks	Not clear that we can get all we need via existent MediaWiki hooks. Makes MediaWiki do much more on the app servers, which may increase load and/or latency in user interactions. May make solving the consistency problem more difficult. Produces to EventGate rather than Kafka directly: less consistency guarantees No way to bootstrap using historical data
Effort	If the risks are not considered, then this could be implemented in a quarter.
Costs
Performance & Scaling	Since MediaWiki itself will be producing more data at request time, we need to be very careful about what happens on MediaWiki app servers.
Deployment	Comprehensive events may be large; we need to be careful that Kafka can handle them.
Rollback and reversibility	Until there are active consumers of new streams, rollback is just as easy as any other code change.
Operations & Monitoring
Additional References
Consultations
Consulted party 1	Search Platform - Zbyszko Papierski
Consulted party 2	WMDE Wikibase and Wikidata - Leszek Manicki
Consulted party 3	Platform Engineering - Petr Pchelko
Consulted party 4	SRE - Giuseppe Lavagetto

Testing
Option 3: Change-Prop / MW Job Queue produces more events
Description	Create new MW jobs to react to existant MediaWiki notification events (e.g. revision-create) to produce new MediaWiki event streams (e.g. wikitext or html revision content).
Benefits	MW Job queue exists and is maintained by Platform Eng and SRE.
Risks	Likely requires the MW job to access the MariaDB database to get data Can only react to one event at a time Produces to EventGate rather than Kafka directly: fewer consistency guarantees Will add load to MW job servers. No way to bootstrap using historical data. Jobs are delayed and can be lost.
Effort	2 quarters?
Costs	Maintenance of new jobs. Will need to be owned by an engineering team.
Performance & Scaling	Need to be careful about requesting too much from the MediaWiki MariaDB (especially if we have to bootstrap a stream with historical data). Comprehensive events may be large; we need to be careful that Kafka can handle them.
Deployment	Jobs will only be run in active DC(?)
Rollback and reversibility	Reversible until someone starts consuming the streams it produces.
Operations & Monitoring	Latency of events (how long does it take between a revision create in MW and a new revision content event to be produced)
Additional References
Consultations
Consulted party 1	SRE Data Persistence - Manuel Arostegui
Consulted party 2	WMDE Wikibase and Wikidata - Leszek Manicki
Consulted party 3	Platform Engineering - Petr Pchelko
Consulted party 4	SRE - Giuseppe Lavagetto

Testing
Option 4: Streaming service produces more events
Description	Streaming service(s) react to existent MediaWiki notification events (e.g. revision-create) and ask the MediaWiki API for more data (e.g. wikitext or html revision content) and produce new event streams. Tech TBD, but could be Flink, Kafka Streams, KNative eventing, etc.
Benefits	Independent from MediaWiki monolith (only coupled via the API) Easy to add new data and streams once we have a baseline service implemented Produces to Kafka directly: more consistency guarantees If needed, possible to get data from data sources other than MW to include in the event. Will need to do more streaming apps in the future, doing this builds expertise and tooling in support to do that.
Risks	Not clear if we can get all we need from MediaWiki API Operating streaming services is new for us (Search Platform has experience now).
Effort	2 or 3 quarters to get the initial service in production. After that, minimal effort to add more data streams.
Costs	Maintenance of streaming service(s). Will need to be owned by an engineering team.
Performance & Scaling	Need to be careful about requesting too much from the MediaWiki API (especially if we have to bootstrap a stream with historical data). Comprehensive events may be large; we need to be careful that Kafka can handle them.
Deployment	Multi datacenter deployments of streaming pipelines is complicated. Search Platform has settled on a pattern (active-active with multi dc compute). We may choose a different pattern here, since the existent MW events are not multi-compute.
Rollback and reversibility	Since this is a separate service, it is reversible until someone starts consuming the streams it produces.
Operations & Monitoring	Stream throughput Latency of events (how long does it take between a revision create in MW and a new revision content event to be produced) Late events
Additional References	Data Platform - Event Driven Services
Consultations
Consulted party 1	Search Platform - Zbyszko Papierski
Consulted party 2	WMDE Wikibase and Wikidata - Leszek Manicki
Consulted party 3	Platform Engineering - Petr Pchelko
Consulted party 4	SRE - Giuseppe Lavagetto

Resource:

https://www.atlassian.com/blog/inside-atlassian/make-team-decisions-without-killing-momentum

Use Cases and required MediaWiki state events

**To show that even though we could solve a lot of comprehensiveness problems with option 2 (EventBus), that still leaves a gap of using that data to compute something new, we likely need a centralized platform to compute new datasets.

Project	What additional data was needed that streams didn’t have?	Which option would have solved this problem?	Why didn’t it use streams?	What did it use?	What was impact of not using streams to implement
Image Suggestions - suggests an image from Commons if an article doesn’t have one	Images linked to article	4 2 - could only help with additional data requests which are fairly small	Additional data is easy to get from MW API but requires an event compute component to run algo that wasn’t easily available	Scheduled monthly batch job	Data only refreshed monthly so dataset can be stale or missing for up to a month
WikiWho (T293386) - assigns ownership to each word of article and revisions	Revision diffs	4 2 - could only help with additional data requests but needs	It does but it’s built outside WMF using RabbitMQ. Is being brought in house using combo of WCS	Copy over of systems used by community - https://phabricator.wikimedia.org/F34639572	WMF increased technical debt for components we may not support (Python pickles, Postgres, etc)
Sections - ML/AI model to define sections of an article	Wikitext?	2 - if LiftWing could listen directly to a new stream 4 - may be needed to transform data/store as we want	Project in early phases so still might	N/A	If we rely on monthly data then dataset can be stale or missing for up to a month
Similar Users	Edit history of user?	Would it be too much for 2 to provide this? 4 - could solve with MW API call and then compute part for algo	Startup costs of event platform like Flink are high for one project	Scheduled monthly batch job	Data only refreshed monthly so dataset can be stale or missing for up to a month
Enterprise?
Wikidata Query Service	Well ordered diffs of the RDF data	4 - needs some way to hold state so updates can be ordered	It did but took a lot of effort to get going	It used Flink streams	N/A
Search updates (The search platform is currently thinking of a possible rewrite using other technologies)	Content (wikitext + html), redirects information (perhaps more, still exploring)	4 - as we probably want to batch updates but also join multiple streams (pageviews data, ores scores, …)	Current setup is written inside MW, this system has the largest footprint on the jobqueue	MW JobQueue	The current JobQueue cannot handle the load induced by CirrusSearch due to its design

Decision Record Drafting Meeting Notes

2022-03-01: Otto and Giuseppe discussion:

Possible we want to enrich events with stuff that might come from other places than MW.
Want to free MW app worker as soon as possible.
1 or 100 more API requests per edit is still okay.
Stream processing approach is the more long term sustainable one.
BUT, if you want something here and now for some short term goal. EventBus okay. Worry: that thing will remain there forever. Don’t want to maintain both forever.
Need to make developing services around the big thing easier. They tend to want to store the data in docker image now.
Preference for stream processing over eventbus.

Feb 15, 2022 | petr & otto discussion

Attendees: Petr Pchelko Andrew Otto Dan Andreescu

Notes

PP: we should dismiss the job queue idea. Worst of both worlds. Still in PHP, but jobs are delayed and can get lost. All downsides.
Making just content events might be ok in eventbus. But if we have 500 new events, maintaining in MW might be difficult.
PP: What about consistency?
DA: perhaps Debezium on just the content table for content events. Rev_id and content, that’s it. This should be a considered solution.
PP: then we could generalize it: when MW table schema is ‘reasonable’ we could just use Debezium for other things too. When not reasonable, use EventBus.
AO: people also will want html content, and page links changes.
PP: maybe sending 4mb of content and12mb html on every edit in a PHP deferred update (eventbus) isn’t great.
PP: my preferred solution: start with EventBus, then do separate streaming service. If fat events gets traction and we need more and more, then we do streaming service solution.
DA: would be easiest now, but what about performance about producing all that data from the app servers after an edit?
What would giuseppe say? Will this bog down app servers?

Action items

Talk to SRE about emitting from EventBus, if okay with them, let’s do it.
However, if this needs to emit many different kinds of events, then maybe doing it in EventBus is not that flexible and we should do streaming service anyway.
Talk with Giuseppe: he prefers streaming service idea. Doing this in EventBus will likely just be tech dept.

Feb 14, 2022 | Discuss Comprehensive MediaWiki Events Decision Record

Attendees: Luke Bowmaker Andrew Otto Petr Pchelko Leszek Manicki David Causse Andy Craze

Note

https://libwas.readthedocs.io/en/latest/What MW state would be most useful to have in streams now?
- Wikitext content
- Wikitext diffs
- Html content
- Page links changes
- Wikibase entity data
- Citation changes? (is this different than links?)

AC: ORES preprocessing for models?
- Most are just fetching article text or diff.
- Every ores model is at the revision level, text and diffs most useful
- In the future, lots of things we can do, depends on use case.

LM: From wikidata/wikibase
- Could be rubbish!? :)
- Wikidata edts are slow sometimes because of abuse filter. Could we build this functionality outside of request pipeline.
- AbuseFilter: Community can set up their own filters, which can slow things down. This is done before page save.
DC: Redirects? These are separate from pages. When a redirect is added to a page, we would like to have an event for this. Consider page as an object with its redirects.
- Existing events have page_is_redirect flag. We could put where the redirect is to by asking MW.
- Other side too. What pages redirect TO a page? Page A is redirected from Page X,Y,X.
- PP: redirect sources are stored in a denormalized table, i think page links.

Solution discussion:

MW Job Queue vs Stream Processor
PP: page content is immutable. You can attach it to whatever event at any time in the future. Async is okay here, it will be correct. Doesn’t really matter if job queue or not.
- Option 2: at request time (EventBus) is actually okay.
- MW Job doesn’t really add us much. Its just more async. Just adding a step that doesn’t really give you anything.
- Option 4 is cool, especially if MySQL External Store had its own API separate from MW.
- Option 4 isn’t really decoupled, its a separate deployment unit, that’s something. But is it worth it?
- AO: Option 2 and 3 have to POST to EventGate.
  - PP: there are maybe ok PHP kafka producers now?
- PP: There are a ton of things that are coded in MW PHP. Having to recode that in other languages is annoying. E.g. MW normalizing page titles.
- PP: What are you getting from doing this from just having all consumers asking API for what they need?
- DC: reading directly from MW events: ordering is hard to accomplish. Reading multiple topics. Streaming processor helps, but it is complicated.