Wikimedia Research/Showcase/Archive/2022/07

July 2022

Time: 9:30am PDT / 12:30pm EDT/ 18:30pm CEST View your local time here
Theme: 2022 Wikimedia Foundation Research of the Year Award Winnersǃ

July 20, 2022 Video: YouTube

Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning

By Krishna Srinivasan (Google)

The milestone improvements brought about by deep representation learning and pre-training techniques have led to large performance gains across downstream NLP, IR and Vision tasks. Multimodal modeling techniques aim to leverage large high-quality visio-linguistic datasets for learning complementary information across image and text modalities. In this talk, I introduce the Wikipedia-based Image Text (WIT) Dataset to better facilitate multimodal, multilingual learning. WIT is composed of a curated set of 37.5 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages.

WIT’s unique advantages include: WIT is the largest multimodal dataset by the number of image-text examples by 3x (at the time of writing). WIT is massively multilingual (first of its kind) with coverage over 100+ languages. WIT represents a more diverse set of concepts and real world entities relative to what previous datasets cover.

WIT Dataset is available for download and use via a Creative Commons license here: https://github.com/google-research-datasets/wit.

I conclude the talk with future directions to expand and extend the WIT dataset. Link to paperː https://arxiv.org/pdf/2103.01913.pdf

Assessing the Quality of Sources in Wikidata Across Languages

By Gabriel Amaral (King's College London)

Wikidata is one of the most important sources of structured data on the web, built by a worldwide community of volunteers. As a secondary source, its contents must be backed by credible references; this is particularly important as Wikidata explicitly encourages editors to add claims for which there is no broad consensus, as long as they are corroborated by references. Nevertheless, despite this essential link between content and references, Wikidata’s ability to systematically assess and assure the quality of its references remains limited. To this end, we carry out a mixed-methods study to determine the relevance, ease of access, and authoritativeness of Wikidata references, at scale and in different languages, using online crowdsourcing, descriptive statistics, and machine learning. The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web. Link to paperː https://dl.acm.org/doi/abs/10.1145/3484828 Link to slidesː https://figshare.com/articles/presentation/Wikimedia_Research_Showcase_Assessing_the_quality_of_sources_in_Wikidata_across_languages/20384322