Talk:Wikimedia Apps/Team/Android/Machine Assisted Article Descriptions

Results are posted

Latest comment: 3 months ago14 comments6 people in discussion

Hello @Certes @MichaelMaggs @MarioGom @Kusma @Fram @Jonesey95 @Thryduulf @Johnbod@Sm8900 ,

My name is Jaz Tanner and I am a Product Manager on the Mobile Apps Team. I am reaching out because at one point you shared feedback about Machine Assistance for Article Descriptions. By the time most of you provided feedback the experiment was already completed, and we already received support from some of your fellow experienced editors of English Wikipedia who participated in the experiment and patrolled the edits that came from the feature. I do want to highlight that the toolforge model was a rough example of the type of edits that came through the app. Client side (through the app) there was a lot more quality control and filters put in place before it reached the end user.

It did unfortunately take longer than expected to post results and migrate the model to a more appropriate host site (LiftWing). Although the model has been migrated, the feature is not turned on. Once the experiment was complete the feature was removed from the app. The next phase of this project is to invite members of different language communities to review the results on our project page and determine if you believe the tool is right for your community. We will welcome feedback on what improvements to the model people might like to see, beyond what we've suggested in the results. We plan to officially start outreach next month, but I did want to proactively tag you all so that you could start reviewing the results and next steps in advance.

Thanks as always for your feedback, patience, time and energy. Its genuinely much appreciated! JTanner (WMF) (talk) 21:36, 15 August 2024 (UTC)Reply

What is the "beam" or "Beam" mentioned in the results? I am unfamiliar with that term. Also, since it has been so long since I provided feedback on this project, can you please link to the feedback that I provided, so that I can be reminded of my suggestions or concerns? I do not see them on this talk page. Thanks. Jonesey95 (talk) 22:39, 15 August 2024 (UTC)Reply

Hi @Jonesey95,

A beam in LLMs represents a single sequence (or partial sequence) of words generated by the model during the text generation process. Beam search is an algorithm that maintains multiple beams (possible sequences) at once, expanding them step by step to generate the best possible output. Within the context of this project, we showed two results using two different article description variants. Our findings tells us that beam one yielded more accurate results than beam two. You can learn more about beam searches here. Your feedback can be found here. JTanner (WMF) (talk) 00:30, 16 August 2024 (UTC)Reply

Thanks. I guess I could have looked higher in the page to learn this "beam" jargon. If you're writing for Wikipedia editors, please avoid unfamiliar jargon. As for the results, do they reflect the tool that we tested and provided feedback on in August 2023, or was the tool fixed after it was shown to be failing completely? If the tool was fixed, I must have missed the call to test it again. If the tool was not fixed, then describing it as ready for deployment under the "Recommendation of if feature should be enabled" column, as is done in the January 2024 results, is a drastic misinterpretation of the tool's output. I just went to https://ml-article-descriptions.toolforge.org/ again and clicked for a random article. It gave me three descriptions for en:Second Tătărăscu cabinet: "Cabinet of the Romanian government" (barely acceptable; missing years); "Cabinet of Romania in the second cabinet" (nonsensical); and "Romanian politician" (factually incorrect). I must be missing something. Jonesey95 (talk) 00:56, 16 August 2024 (UTC)Reply

Thanks for this feedback @Jonesey95. Can you recommend what would have been more helpful phrasing? Beam is the literal technical term for it, hearing what would've been more helpful phrasing would support me in reframing for the future.

Yes, I think perhaps the misunderstanding is that the toolforge version is a raw unfiltered version of the model. On the project page we mention, "Please keep in mind, there are a bunch of filters that are being added client side to improve the quality of the model. Those safeguards can be read in the Risk Management portion of this page." Also, in the next steps section, we provide a description of quality guardrails that were place when the experiment ran, and additional ones that could be added after.

The toolforge model gives a rough idea of how the model works but it isn't entirely true to what users saw because that is a version without quality controls. The version users saw in the app included a lot of filters and quality controls. If you'd like to know what the model produces with the quality measures in place, I can work with our engineers to spin up a test version of the app that you can download if you have access to an Android device, however, this may take a few weeks. Alternatively, you can sign the same Non-Disclosure the graders did to see the output based on what other editors published using the tool.

However, if you're ok with a sample, I picked a few edits that came from editors using the too during the experiment period:

The last article is one you've edited as well. In the version history of the article you can see if the article description was added with a machine suggestion, if the edit summary says #machine-suggestion. #Machine-suggestion-modified means the editor used the suggestion as a starting place but modified the suggestion in some way.

You can compare the article descriptions of these 5 articles, which was created through the app to the Toolforge model. You'll see for example, the Active Member article shows an error message through Toolforge, but worked through the app.

Additionally, I am not surprised that when seeing 3 beams, two of them have inaccuracies. It aligns with our data of editors choosing the first beam most of the time and it having higher accuracy, which is also why we are only going to show the first beam in conjunction with the aforementioned quality filters should different language communities request the feature be reinstated on their Wiki. If you haven't had an opportunity to see how the beams (choices) were presented in the app, it might be helpful to review the designs. JTanner (WMF) (talk) 03:33, 16 August 2024 (UTC)Reply

Instead of "beam", use "short description" (the English Wikipedia term) or "description" (the Wikidata term). There is no need for a new jargon term when addressing external audiences, i.e. WP editors.

Thanks for the links. I have fixed them to show the relevant diffs. The first two SD results were reasonable. The "Free response question" SD was subpar and did not follow en.WP's guidance, but I have seen many human editors make the same error. The "Active Member" SD was wrong; the word "band" is inaccurate and does not appear in the article. The "833 cents scale" SD was wrong; it left out a necessary word, changing the meaning. So that's three out of five that needed to be fixed. I am dubious about a tool of this quality being rolled out as if it were a good-quality tool. At a larger scale, I am much more dubious about the WMF's development priorities: there are a great many long-standing bugs and feature requests, the resolution of which would significantly improve the editing experience. I just don't understand the amount of effort put into new tools that do a poor job of what they set out to do when there is so much basic improvement to be done. Jonesey95 (talk) 17:17, 16 August 2024 (UTC)Reply

Thanks for this feedback I do appreciate that you are engaging with me here @Jonesey95. Ok I will continue to evolve the explanation of beams in writing. It is a short description but I do want to make sure I am conveying people the output of the type of short description they see is different depending on the algorithm used. I am also passing this on to the researchers that built the model, where it lists "Beam". Free Response and Active Member were modified by a human, I will have to go back and see what the original format was and see if not following the format was human error or the model. Thanks for the feedback about 833 cents scale, I will show this to the researcher so they can improve the model.

Not sure if you had an opportunity to take part in the Annual Planning process, but highly encourage you to engage there so we can get your feedback early about what the Foundation builds for the year. This was built more than a year ago and at the time it was with the hope that it would improve the quality of article short descriptions, which was a request from some members of the community. One way we tackled this was by honoring a request you made in 2022 by setting a gate so that brand new users aren't able to easily had short descriptions using our tools, so we created an edit gate before the feature is revealed. Hearing your feedback is genuinely helpful, and letting researchers know ways they can improve the model is something I intend to do. Until there is consensus on a wiki that the model is good enough where it causes more help than harm we won't reintegrate it back into the app. In the meantime we will keep working on other things that folks request of us in alignment with the Annual Plan. JTanner (WMF) (talk) 17:50, 16 August 2024 (UTC)Reply

I'm afraid I find all this rather confusing, largely because it is unclear exactly what has been tested, and by whom. As I recall, you asked for feedback on a generic (multiple language Wikipedias) toolforge model, without filters, which was pretty uniformly negative. The statistics you now present on the project page shows that the machine suggestion was rejected 62% of the time, which really doesn't sound encouraging. If I read your comments correctly, you're suggesting that those results aren't unexpected, and will be improved by the addition of some filters or "quality guardrails". But I can't tell whether you have already sought user feedback on and carried out an analysis of the final results after all necessary filtering.

You mentioned that you "received support from some of your fellow experienced editors of English Wikipedia". It's possible I missed a posting, but I don't recall any request for volunteers to review filtered output on the English Wikipedia's WikiProject_Short_descriptions page, which is where all the editors experienced in short descriptions can be found. If not already done, that really is essential.

Could you explain where in your workflow you are making the necessary distinctions between the rules of the English Wikipedia and of the other Wikipedias? While I generally don't like English Wikipedia exceptionalism, I do find it worrying that your main project page still completely ignores very well established enWiki guidance, even incorrectly stating that what you call 'article descriptions' are stored on Wikidata, and that the 60-character text "Spiral galaxy in the Local Group containing the Solar System" would be suitable as a short description. It's absolutely not. I'm sorry to note that the feedback I provided a year ago, has so far as I can tell not received any response nor has it influenced what you are proposing to do on enWiki one jot. Let me repeat it here, for convenience, as it's just as relevant now as it was then:

Hi @ARamadan-WMF, as I'm sure you know the English Wikipedia has some very well developed guidance at Wikipedia:Short description for writing short descriptions, which enjoys wide community suppport. Could you please indicate how your project tries to ensure compliance with those rules? Do you, for example, run a post-processing step to ensure that you meet WP:SDFORMAT and WP:SDDATES standards, and that you use WP:SDNONE appropriately?

You will want to change your nomenclature, as well, as the expression "Article description" means nothing here. The term consistently used for the last six years, ever since these things were invented, is "Short description". Use of the word "short" is of real practical importance, as new users often tend to assume that descriptions of 60 or even 80 characters are OK. The fact that your system easily generates impossible suggestions of over 90 characters (eg "Hunger") tells me straight away that you haven't considered WP:SDSHORT.

You seem to be running a multi-language project that doesn't so far as I can tell take any account of the rules and customs here. I suggest you might like to seek advice from Wikipedia:WikiProject Short descriptions, which is pretty active. (MichaelMaggs (talk) 10:39, 18 August 2023 (UTC))Reply

Please don't continue to side-step these issues, as they will affect whether you can get EnW community approval to implement this. On the English Wikipedia, per WP:SDCONTENT, The short description is part of the article content, and is subject to the normal rules on content. Like all other content in the encyclopedia, the content of each short description is subject to community rules and community approval, and unless your filters can implement such things as WP:SDSHORT, WP:SDDATES and WP:SDNONE, then you won't get that community approval.

Finally, let me ask a more general question. Is this project useful for the English Wikipedia at all? It may well be useful for other less established Wikipedias, but on enW over 86% of article already have a short description thanks to a combination of manual edits, approved bots, and wording automatically generated from inboxes. Nearly 7 million articles have been dealt with over the last four years, and at the current rate of progress we expect to complete the final 900,000 by 2026. After that, new descriptions will be required only on new articles, and we have more than enough editors to manually keep up with such a workload, at a far higher quality than will in the foreseeable future be possible with AI. If a new editor can't write a decent short description, it's much better that the field is left blank, rather than written poorly. New articles with missing short description can be automatically added to a checking category, ready for someone to quickly fix. Poorly written AI short descriptions are much harder to identify and manually correct. MichaelMaggs (talk) 18:11, 16 August 2024 (UTC)Reply

thanks for this notice. i have to agre with the comments above which question any need for ai to assist with articles. Sm8900 (talk) 20:32, 16 August 2024 (UTC)Reply

Hi @MichaelMaggs,I think the current state of the project page, which hasn’t been through an overall update for a while, is creating some confusion about the state of the project. In a nutshell, the model was removed and the experiment was deactivated in May 2023. From April to May 2023, it was tested in the Android app, and from May 2023 to July 2023 it was graded by volunteers. I will better delineate what quality controls were in place during the time of the experiment and further expand the next steps section of what controls can be added with additional examples.I will also expand parts of the report, like what rejection means. Rejection only means someone chose to write a description when a suggestion was present (not necessarily visible). This could look like someone typing out the description without opening the entry point to see what the suggestion was, this may be easier to visualize by looking at the designs. Overall the purpose of this update was to simply share the results of the 2023 experiment and to let communities know if they’d like to adopt the tool they could, but there are no plans or intentions to deploy it anywhere if communities aren’t interested. Thanks in advance for your patience as I fill any missing gaps. @ARamadan-WMF will have more information soon about our plans to share the research results more broadly. JTanner (WMF) (talk) 21:28, 16 August 2024 (UTC)Reply

Hi JTanner (WMF), thanks for replying. Did you have comments on any of the substantive issues I raised, please? MichaelMaggs (talk) 22:33, 16 August 2024 (UTC)Reply

Hopefully I'm not missing something obvious here amid what is a rather technical undertaking—though I understand the technicalities themselves well enough—but after reading this I also admit I do not quite understand the underlying motivation for this specific experiment. If I had to guess, article descriptions (using the general term?) have been identified as a particularly viable area for LLM generation due to both their brevity and their semantic "shallowness"—is that fair?—it doesn't seem like science fiction reasoning faculties would be required of an AI supplying them for articles. However, that very shallowness means that they are very easy for editors to write themselves, more than any other part of an article really. I am not clear on either the utility of a theoretical production-ready edition of this tool, or on the applicability of this task to further research. Article descriptions seem so semantically shallow that techniques for writing them would be insufficient to adapt for any other task. Hopefully that all makes sense, I hope my attempt to engage is well-received. Remsense (talk) 00:02, 17 August 2024 (UTC)Reply

Hello @Remsense,

Thanks for raising this question. Our team does not have a definitive answer of if writing short descriptions is easy or not, it appears it depends on the individual. When the original prompt to editors to add short descriptions to articles was created, there were people that held the belief that it was an easy task with the right guidance. Over time we have heard from volunteer editors that the task is not as simple as some people believe. There has been context added to the main page that hopefully better explains how this tool came to be and the goals.; If there are other techniques that you feel are worthy of exploring don’t hesitate to share what those might be so that we can consider them for the future. ARamadan-WMF (talk) 10:35, 24 September 2024 (UTC)Reply