Technical decision making/Decision records/T291120
Original Google Doc with Decision Record
What are your constraints?
[edit]General Assumptions and Requirements | Source
The person or document that this requirement comes from. |
We will not fully solve comprehensiveness all at once. Instead, we will aim to increase comprehensiveness by targeting a subset of important use cases. We will initially target:
|
|
This Decision Record specifically addresses getting more MediaWiki state into events. It does not include recommendations or solutions for building other event driven services. However it will make building services that use MediaWiki state in events possible. | |
Security Requirements
Describe any security limitations or constraints for the proposed solutions. |
|
We don’t currently have any access control over how engineers produce or consume event stream data. This will not be solved in this decision record, but we should be aware of this while solutioning. | |
Privacy Requirements
Describe potential privacy limitations or constraints for the proposed solutions and how they will be mitigated. |
|
Emitting state changes into streams means that those changes will persist immutably for the retention period of each stream. We should be careful with PII, and 1. Never expose it publicly and 2. Include a way to remove this data using Kafka compacted topics. Using compacted topics might be out of scope of this Decision Record, but should be considered. |
Decision
[edit]Selected Option | Option 4: Streaming Service(s) |
Rationale | Option 1. Do Nothing and Option 3. JobQueue can easily be ruled out. JobQueue was quickly dismissed by Petr as not reliable enough.
Option 2. EventBus is a possibility, but really only if we only ever planned to add 1 or 2 new event streams. This leaves Option 4. Streaming Service(s). Doing this in a streaming service allows us to have a decoupled deployment from MediaWiki, and will be more flexible when adding new event streams. It also allows us to build more expertise and tooling for doing this for new products in the future. |
Data | See Options below. |
Informing | Andrew Otto and Luke Bowmaker will be informing and working with others on this. Our initial need for interfacing with others will be to vet any new event streams we design to be sure they meet potential needs. |
Who | Andrew Otto |
Date | 2022-03-02 |
What are your options?
[edit]Option 1: Do Nothing | |
Description | Continue using the existent MediaWiki event streams for state transfer. |
Benefits | No streaming services to maintain. |
Risks | Teams that need state outside of MediaWiki will have to get it themselves, injecting latency and complexity into their data pipelines.
Limits our ability to create timely and more relevant data products. |
Effort | More for teams implementing services. |
Option 2: MediaWiki directly produces more events | |
Description | MediaWiki directly emits more comprehensive event streams via EventBus extension. |
Benefits |
|
Risks |
|
Effort |
|
Costs | |
Testing | |
---|---|
Performance & Scaling | Since MediaWiki itself will be producing more data at request time, we need to be very careful about what happens on MediaWiki app servers. |
Deployment |
|
Rollback and reversibility | Until there are active consumers of new streams, rollback is just as easy as any other code change. |
Operations & Monitoring | |
Additional References | |
Consultations | |
Consulted party 1 | Search Platform - Zbyszko Papierski |
Consulted party 2 | WMDE Wikibase and Wikidata - Leszek Manicki |
Consulted party 3 | Platform Engineering - Petr Pchelko |
Consulted party 4 | SRE - Giuseppe Lavagetto |
Option 3: Change-Prop / MW Job Queue produces more events | |
Description | Create new MW jobs to react to existant MediaWiki notification events (e.g. revision-create) to produce new MediaWiki event streams (e.g. wikitext or html revision content). |
Benefits |
|
Risks |
|
Effort | 2 quarters? |
Costs | Maintenance of new jobs. Will need to be owned by an engineering team. |
Testing | |
---|---|
Performance & Scaling |
|
Deployment | Jobs will only be run in active DC(?) |
Rollback and reversibility | Reversible until someone starts consuming the streams it produces. |
Operations & Monitoring |
|
Additional References | |
Consultations | |
Consulted party 1 | SRE Data Persistence - Manuel Arostegui |
Consulted party 2 | WMDE Wikibase and Wikidata - Leszek Manicki |
Consulted party 3 | Platform Engineering - Petr Pchelko |
Consulted party 4 | SRE - Giuseppe Lavagetto |
Option 4: Streaming service produces more events | |
Description | Streaming service(s) react to existent MediaWiki notification events (e.g. revision-create) and ask the MediaWiki API for more data (e.g. wikitext or html revision content) and produce new event streams. Tech TBD, but could be Flink, Kafka Streams, KNative eventing, etc. |
Benefits |
|
Risks |
|
Effort | 2 or 3 quarters to get the initial service in production. After that, minimal effort to add more data streams. |
Costs | Maintenance of streaming service(s). Will need to be owned by an engineering team. |
Testing | |
---|---|
Performance & Scaling |
|
Deployment | Multi datacenter deployments of streaming pipelines is complicated. Search Platform has settled on a pattern (active-active with multi dc compute). We may choose a different pattern here, since the existent MW events are not multi-compute. |
Rollback and reversibility | Since this is a separate service, it is reversible until someone starts consuming the streams it produces. |
Operations & Monitoring |
|
Additional References | Data Platform - Event Driven Services |
Consultations | |
Consulted party 1 | Search Platform - Zbyszko Papierski |
Consulted party 2 | WMDE Wikibase and Wikidata - Leszek Manicki |
Consulted party 3 | Platform Engineering - Petr Pchelko |
Consulted party 4 | SRE - Giuseppe Lavagetto |
Resource:
https://www.atlassian.com/blog/inside-atlassian/make-team-decisions-without-killing-momentum
Use Cases and required MediaWiki state events
[edit]**To show that even though we could solve a lot of comprehensiveness problems with option 2 (EventBus), that still leaves a gap of using that data to compute something new, we likely need a centralized platform to compute new datasets.
Project | What additional data was needed that streams didn’t have? | Which option would have solved this problem? | Why didn’t it use streams? | What did it use? | What was impact of not using streams to implement |
---|---|---|---|---|---|
Image Suggestions - suggests an image from Commons if an article doesn’t have one | Images linked to article | 4
2 - could only help with additional data requests which are fairly small |
Additional data is easy to get from MW API but requires an event compute component to run algo that wasn’t easily available | Scheduled monthly batch job | Data only refreshed monthly so dataset can be stale or missing for up to a month |
WikiWho (T293386) - assigns ownership to each word of article and revisions | Revision diffs | 4
2 - could only help with additional data requests but needs |
It does but it’s built outside WMF using RabbitMQ. Is being brought in house using combo of WCS | Copy over of systems used by community - https://phabricator.wikimedia.org/F34639572 | WMF increased technical debt for components we may not support (Python pickles, Postgres, etc) |
Sections - ML/AI model to define sections of an article | Wikitext? | 2 - if LiftWing could listen directly to a new stream
4 - may be needed to transform data/store as we want |
Project in early phases so still might | N/A | If we rely on monthly data then dataset can be stale or missing for up to a month |
Similar Users | Edit history of user? | Would it be too much for 2 to provide this?
4 - could solve with MW API call and then compute part for algo |
Startup costs of event platform like Flink are high for one project | Scheduled monthly batch job | Data only refreshed monthly so dataset can be stale or missing for up to a month |
Enterprise? | |||||
Wikidata Query Service | Well ordered diffs of the RDF data | 4 - needs some way to hold state so updates can be ordered | It did but took a lot of effort to get going | It used Flink streams | N/A |
Search updates (The search platform is currently thinking of a possible rewrite using other technologies) | Content (wikitext + html), redirects information (perhaps more, still exploring) | 4 - as we probably want to batch updates but also join multiple streams (pageviews data, ores scores, …) | Current setup is written inside MW, this system has the largest footprint on the jobqueue | MW JobQueue | The current JobQueue cannot handle the load induced by CirrusSearch due to its design |
Decision Record Drafting Meeting Notes
[edit]2022-03-01: Otto and Giuseppe discussion:
[edit]- Possible we want to enrich events with stuff that might come from other places than MW.
- Want to free MW app worker as soon as possible.
- 1 or 100 more API requests per edit is still okay.
- Stream processing approach is the more long term sustainable one.
- BUT, if you want something here and now for some short term goal. EventBus okay. Worry: that thing will remain there forever. Don’t want to maintain both forever.
- Need to make developing services around the big thing easier. They tend to want to store the data in docker image now.
- Preference for stream processing over eventbus.
Feb 15, 2022 | petr & otto discussion
[edit]Attendees: Petr Pchelko Andrew Otto Dan Andreescu
Notes
- PP: we should dismiss the job queue idea. Worst of both worlds. Still in PHP, but jobs are delayed and can get lost. All downsides.
- Making just content events might be ok in eventbus. But if we have 500 new events, maintaining in MW might be difficult.
- PP: What about consistency?
- DA: perhaps Debezium on just the content table for content events. Rev_id and content, that’s it. This should be a considered solution.
- PP: then we could generalize it: when MW table schema is ‘reasonable’ we could just use Debezium for other things too. When not reasonable, use EventBus.
- AO: people also will want html content, and page links changes.
- PP: maybe sending 4mb of content and12mb html on every edit in a PHP deferred update (eventbus) isn’t great.
- PP: my preferred solution: start with EventBus, then do separate streaming service. If fat events gets traction and we need more and more, then we do streaming service solution.
- DA: would be easiest now, but what about performance about producing all that data from the app servers after an edit?
- What would giuseppe say? Will this bog down app servers?
Action items
- Talk to SRE about emitting from EventBus, if okay with them, let’s do it.
- However, if this needs to emit many different kinds of events, then maybe doing it in EventBus is not that flexible and we should do streaming service anyway.
- Talk with Giuseppe: he prefers streaming service idea. Doing this in EventBus will likely just be tech dept.
Feb 14, 2022 | Discuss Comprehensive MediaWiki Events Decision Record
[edit]Attendees: Luke Bowmaker Andrew Otto Petr Pchelko Leszek Manicki David Causse Andy Craze
Note
- https://libwas.readthedocs.io/en/latest/What MW state would be most useful to have in streams now?
- Wikitext content
- Wikitext diffs
- Html content
- Page links changes
- Wikibase entity data
- Citation changes? (is this different than links?)
- AC: ORES preprocessing for models?
- Most are just fetching article text or diff.
- Every ores model is at the revision level, text and diffs most useful
- In the future, lots of things we can do, depends on use case.
- LM: From wikidata/wikibase
- Could be rubbish!? :)
- Wikidata edts are slow sometimes because of abuse filter. Could we build this functionality outside of request pipeline.
- AbuseFilter: Community can set up their own filters, which can slow things down. This is done before page save.
- DC: Redirects? These are separate from pages. When a redirect is added to a page, we would like to have an event for this. Consider page as an object with its redirects.
- Existing events have page_is_redirect flag. We could put where the redirect is to by asking MW.
- Other side too. What pages redirect TO a page? Page A is redirected from Page X,Y,X.
- PP: redirect sources are stored in a denormalized table, i think page links.
Solution discussion:
- MW Job Queue vs Stream Processor
- PP: page content is immutable. You can attach it to whatever event at any time in the future. Async is okay here, it will be correct. Doesn’t really matter if job queue or not.
- Option 2: at request time (EventBus) is actually okay.
- MW Job doesn’t really add us much. Its just more async. Just adding a step that doesn’t really give you anything.
- Option 4 is cool, especially if MySQL External Store had its own API separate from MW.
- Option 4 isn’t really decoupled, its a separate deployment unit, that’s something. But is it worth it?
- AO: Option 2 and 3 have to POST to EventGate.
- PP: there are maybe ok PHP kafka producers now?
- PP: There are a ton of things that are coded in MW PHP. Having to recode that in other languages is annoying. E.g. MW normalizing page titles.
- PP: What are you getting from doing this from just having all consumers asking API for what they need?
- DC: reading directly from MW events: ordering is hard to accomplish. Reading multiple topics. Streaming processor helps, but it is complicated.