Jump to content

Wikimedia Research/Showcase/Archive/2024/07

From mediawiki.org

July 2024

[edit]
Time
Wednesday, July 24, 16:30 UTC: Find your local time here
Theme
Machine Translation on Wikipedia

July 24, 2024 Video: YouTube

The Promise and Pitfalls of AI Technology in Bridging Digital Language Divide
By Kai Zhu, Bocconi University
Machine translation technologies have the potential to bridge knowledge gaps across languages, promoting more inclusive access to information regardless of native languages. This study examines the impact of integrating Google Translate into Wikipedia's Content Translation system in January 2019. Employing a natural experiment design and difference-in-differences strategy, we analyze how this translation technology shock influenced the dynamics of content production and accessibility on Wikipedia across over a hundred languages. We find that this technology integration leads to a 149% increase in content production through translation, driven by existing editors becoming more productive as well as an expansion of the editor base. Moreover, we observe that machine translation enhances the propagation of biographical and geographical information, helping to close these knowledge gaps in the multilingual context. However, our findings also underscore the need for continued efforts to mitigate the preexisting systemic barriers. Our study contributes to our knowledge on the evolving role of artificial intelligence in shaping knowledge dissemination through enhanced language translation capabilities.


Implications of Using Inorganic Content in Arabic Wikipedia Editions
By Saied Alshahrani and Jeanna Matthews, Clarkson University
Wikipedia articles (content pages) are one of the widely utilized training corpora for NLP tasks and systems, yet these articles are not always created, generated, or even edited organically by native speakers; some are automatically created, generated, or translated using Wikipedia bots or off-the-shelf translation tools like Google Translate without human revision or supervision. We first analyzed the three Arabic Wikipedia editions, Arabic (AR), Egyptian Arabic (ARZ), and Moroccan Arabic (ARY), and found that these Arabic Wikipedia editions suffer from a few serious issues, like large-scale automatic creations and translations from English to Arabic, all without human involvement, generating content (articles) that lack not only linguistic richness and diversity but also content that lacks cultural richness and meaningful representation of the Arabic language and its native speakers. We second studied the performance implications of using such inorganic, unrepresentative articles to train NLP tasks or systems, where we intrinsically evaluated the performance of two main NLP upstream tasks, namely word representation and language modeling, using word analogy and fill-mask evaluations. We found that most of the models trained on the organic and representative content outperformed or, at worst, performed on par with the models trained with inorganic content generated using bots or translated using templates included, demonstrating that training on unrepresentative content not only impacts the representation of native speakers but also impacts the performance of NLP tasks or systems. We recommend avoiding utilizing the automatically created, generated, or translated articles on Wikipedia when the task is a representation-based task, like measuring opinions, sentiments, or perspectives of native speakers, and also suggest that when registered users employ automated creation or translation, their contributions should be marked differently than “registered user” for better transparency; perhaps “registered user (automation-assisted)”.