Wikimedia Apps/Team/Android/Machine Assisted Article Descriptions/Experiment Background

Experiment background

The Android team is teaming up with Research and EPFL to improve article short descriptions, also known as short descriptions and occasionally referred to as beams.

Currently Android app users can create and edit article short descriptions via suggested edits. Article short descriptions go to Wikidata with the exception of article short descriptions for English Wikipedia. The Android team has received feedback that new users produce low-quality article short descriptions (T279702). In 2022, the team placed a temporary restriction on Suggested Edits for users that had less than 3 edits for English Wikipedia users (T304621) with the intent on finding methods of improving the quality of article short descriptions by new users.

EPFL and Research reached out to the Android team with a model called Descartes that can generate descriptions performing on par with human editors. Descartes takes the information on a Wikipedia article page and provides a short description of the article while adhering to the guidance of what makes an article description helpful. During initial evaluation of the model, it was preferred more than 50% of the time over human generated article short descriptions. Additionally, Descartes held a 91.3% accuracy rate in testing. Despite these very promising results, the team wanted to do our due diligence by conducting an ABC test to ensure the suggestions will improve the quality of article short descriptions when suggested to new editors, without introducing or increasing existing bias. We created an API which is hosted on Toolforge and will integrate the model into our existing interface in order to conduct our experiment. We will patrol edits made through the experiment in partnership with volunteers to not burden patrollers.

Product requirements

Users being able to provide feedback on individual suggestions should they detect issues
Accommodate two machine generated suggestions to test which beam is more accurate
Onboard users to Machine Generated suggestions
Reminder popups of checking for bias when clicking a suggestion on a biography
Only experienced users will see suggestions for biographies
Ability for users to write in their own response and edit a suggestion
Incorporate icon that identifies the product uses machine learning
Multilingual compatibility with mBART25

Objective and indicators

As a first step in the implementation of this project, the Android team will develop a MVP with the purpose of:

Determine if suggestions made through the Descartes model increases the quality of article description additions and edits made using the Wikipedia Android app. To understand how the suggested article description changes user behavior, we will evaluate:
- If introduction of suggestions alters the stickiness of the task type across editing tenure
- Variability in task completion time relative to quality of edits
- How often users modify suggestions before hitting publish
- The optimal design and user workflow to encourage accuracy and task retention
- What, if any, additional measures need to be in place to discourage bad or bias suggestions
Determine if the algorithm holds up when exposed to more users:
- Does the accuracy and preference rate change when exposed to more users
- Does the accuracy and preference rate of using the suggestion vary greatly across languages
- Is the algorithm introducing bias (e.g. misgendering) or not accurately representing critical nuance for Biographies of Living Persons
- How does the accuracy rate and performance change when showing more than one suggestion

Should the 30 day experiment show promising results based on the indicators above, the team will introduce the feature to all users and remove our 3 edit requirement for suggested edits. We will also take steps to expand the number of languages to mBART 50 and migrate the API from toolforge to a more permanent home.

Volunteer Graders

The team will partner with volunteers to patrol edits made during the time of the experiment and assign a grade to the edit.

This will serve as one input for determining if the quality of edits increase when using machine generated article short descriptions. Volunteer graders can sign up below or reach out to ARamadan-WMF.

The commitment for serving as a volunteer grader is up to one hour a week for four weeks.

Decision to be made

This A/B test will help us make the following decision:

Expand the feature to all users
Use suggestion as a means to train new users and remove 3 edit minimum gate
Migrate model to more permanent API
Show 1 or 2 beams
Expand to mBART 50

ABC Logic Explanation

Experiment will include only logged in users, in order to stabilize distribution.

The only users that will see the suggestions are those in mBART25

Of those in mBART25 half will see suggestions (B: Treatment) and half will not see suggestions (Control)
Of those in mBART25 only users that have more than 50 edits can see suggestions for Biographies of Living Persons, and if the users are in the non-BLP group, they will remain in it, even if they cross 50 edits during the experiment.

Additionally, we care about how the answers to our experiment will differ by language wiki and user experience (<50 New vs. 50+ Experienced).

Decision to be made

If the accuracy rate for edits that came from the suggestion is less than those manually written, we will not keep the feature in the app. The accuracy rate will be determined based on manual patrolling.
If the accuracy rate for edits that came from the suggestion is less than 80%, we will not keep the feature in the app. The accuracy rate will be determined based on manual patrolling.
If the time spent to complete the task using the suggestion is double the average rate as those that do not see suggestions we will need to compare it to reports to see if there are performance issues
If time spent to complete the task using the suggestion is less than the average without a negative impact to accuracy rate, we will consider it a positive indicator to expand the feature to more users
If users that see the suggestion modify the suggestions more often than submitting it without modification, we will evaluate its accuracy rate compared to users that did not see the suggestions and determine if the suggestion is a good starting point for users and how it differs by user experience
If users that see the suggestion modify the suggestions more often than submitting them without modification, we will look for trends in the modification and offer a recommendation to EPFL to update the model
If beam one is chosen more than 25% of the time than beam two while having an equal or higher accuracy rate, we will only show beam one in the future
If users that see treatment return to the task multiple times (1,2,7,14 days) at a rate 15% or more than the control group without a negative impact to accuracy, we will take steps to expand the feature
If our risks are triggered we will implement our contingency plan
If users that see the treatment do not select a suggestion more than 50% of the time after viewing the suggestions, we will not expand the feature

In aggregate, there should be at least 1500 people with a stretch goal of **2,000 people** and 4,000 edits included in the A/B test across the following mBART25 wikis: English, Russian, Vietnamese, Japanese, German, Romanian, French, Finnish, Korean, Spanish, Chinese (sim), Italian, Dutch, Arabic, Turkish, Hindi, Czech, Lithuanian, Latvian, Kazakh, Estonian, Nepali, Sinhala, Gujarati, and Burmese.

Risk management

Any time Machine Learning is used, we introduce a greater deal of risks than what is already involved in software development. For that reason, we are tracking and managing risks associated with this project alongside our Security and Legal team.


Risk	Cause	Level	Response	Response Action	Trigger	Contingency Plan
Algorithm Defames Living People	Algorithm pulls controversial aspects of a living person and includes it in description.	Low	Mitigate	We will monitor the output of what gets published and see what is reported to make adjustments to the learning model. In testing we haven't seen a case of this, quite the opposite, we see cases of the algorithm whitewashing history. As an extra precaution, we will only allow experienced editors to edit biographies of living people.	Defamation detected during patrolling	Remove suggestions on BLPs completely
Overwhelm patrollers	New feature increases interest in task type and algorithm doesn't increase quality of edits	Med	Mitigate	We will have a dedicated team of people that will patrol the edits from this feature to not overwhelm volunteer patrollers, and give advanced notice to Wikidata and English Community.	Staff unable to keep up with patrolling demands	Restrict the number of tasks with suggestions in a day
Proposes NSFW Content	There is NSFW content in the article that is suggested for the description	Low	Mitigate	The algorithm pulls primarily from the first paragraph. We have a reporting mechanism and will be patrolling edits.	If 2% or more users report a problem	We will hardcode block words based on the abuse filter
Users abandon task due to performance issues	The model is on a temporary host and showing more than one option can take a while for generation	Med	Mitigate	Load answers in the background before users click the button to show suggestions.	4/10 users express performance issues during usability testing	Show one option or make other changes to UI
Misgendering or ethnic hallucinations	Algorithm incorrectly genders people or provides incorrect ethnciity	Med	Mitigate	During the experiment this is something we will deliberately look for in patrolling and monitor reports	If reported more than 2% of time	We will pause the feature and hard code reminders and decrease suggestion to one suggestion

How to follow along

We have created T316375 as our Phabricator Epic to track this work. We encourage your collaboration there or on our Talk Page.

There will also be periodic updates to this page as we make progress. You can also test the model at https://ml-article-descriptions.toolforge.org/. Please keep in mind, there are a bunch of filters that are being added client side to improve the quality of the model. Those safeguards can be read in the Risk Management portion of this page. ^{Superscript text}