Jump to content

Content translation/Published translations

From mediawiki.org

Information about published translations are generally helpful for machine translation developers and others for different purposes such as terminology extraction and cross-linguistic research. Content Translation aims to provide data about translations under an open license. The amount and details of the data will be improved over time. This page captures the current state.

List of published source and target titles

[edit]

Content translation has an API to get list of all published translations across languages.

  • List of all published translations across all languages. Example
  • List of all published translation between two languages. Example

Currently the API output returns the following details (illustrated with an example).

{
  "translationId": "510",
  "sourceTitle": "Tequendama Falls Museum",
  "targetTitle": "Casa Museo Salto de Tequendama Biodiversidad y Cultura",
  "sourceLanguage": "en",
  "targetLanguage": "es",
  "sourceURL": "//en.wikipedia.org/wiki/Tequendama Falls Museum",
  "targetURL": "//es.wikipedia.org/wiki/Casa Museo Salto de Tequendama Biodiversidad y Cultura",
  "publishedDate": "20151006230043",
  "sourceRevisionId": "35676",
  "targetRevisionId": "7689875",
  "stats": {
      "any": 0.93459552495697,
      "human": 0.67469879518072,
      "mt": 0.25989672977625,
      "mtSectionsCount": 2
  }
}

The stats data shows the percentage of translation completion. human indicates manual translation percentage. mt indicates machine translation percentage. Any edits to machine translation output are considered as manual edits. The percentages are calculated at section level. any indicates the total translation (any=human+mt). Content Translation does not demand full translation of the source article. Users can freely translate as many or as few sections as they want. mtSectionsCount shows the total number of translated sections. These stats are also used for abuse prevention (read more about the percentage calculation in that page).

Parallel corpora

[edit]

Along with the new articles created using translations, the source and translated articles are good sources for parallel text. Content Translation collects these and makes it available for everyone. Machine translation developers can use this to train their machine translation systems. Content Translation also captures the alignment of sections in source and translation, and in some cases even on sentence granularity using HTML markup in the sections. Content Translation does not do any kind of automatic alignment and the provided alignment is only best effort based on how the connections were preserved while translation happens. When automatically aligning the sentences, it is good to remember that the translations do not necessarily match 1:1.

API

[edit]

To access the parallel text of a single translation, there is a separate API. First, one should know the translation id. This can be obtained from the cxpublishedtranslations API explained above. To get the section level aligned parallel text, use contenttranslationcorpora API.

Example: https://en.wikipedia.org/w/api.php?action=query&list=contenttranslationcorpora&translationid=108992

You can see that the output is JSON formatted and contains section level contents. A section is paragraph or headers or figures. Technically a block level element in HTML. Every section contains up to three versions

  1. source: The source content.
  2. mt: The machine translated content. If the language pair involved has a machine translation service and translator used it, this section in output will have unmodified machine translation of source section. It will be empty if machine translation was not used.
  3. user: The final translation by user. This will be an improved version by manual edits on top of machine translation. Or even translation from scratch if there is no MT.

By default, the section contents are HTML. But if you prefer to get plain text version of each section, use striphtml argument in the API.

Example: https://en.wikipedia.org/w/api.php?action=query&list=contenttranslationcorpora&translationid=108992&striphtml=true

If you wish to get only source and user versions, use types argument. By default its value is source|mt|user

Example: https://en.wikipedia.org/w/api.php?action=query&list=contenttranslationcorpora&translationid=108992&striphtml=true&types=source%7Cuser

Note: The output of this API will be empty for old translations (before 2016-01-22). This is because the API and the required infrastructure was introduced only at that date. We did not capture the parallel text for old translations. But if you have a good aligner, you may still use the real article pairs from Wikipedia using the output of cxpublishedtranslations API.

Dumps

[edit]

Because accessing translations one by one is inconvenient, we are providing the translation dumps in tmx and json formats. They can be downloaded here. Dumps are available in tmx, json format. For large translation pairs, separate dump files are present. For example, cx-corpora.ca2es.text.tmx.gz is the parallel corpora dump file for Catalan to Spanish translations. For smaller language pairs a single file with all languages are provided. They are named as cx-corpora._2_.text.tmx.gz.

Note that dumps (unlike API data), only include translations where editors have made some changes (i.e., the "human" value is greater than 0). The reason for this is that dumps are focused on capturing the corrections users made to machine translation systems to improve them.

External repositories

[edit]

Data about published translations has been integrated into the OPUS project. The OPUS project goal is to create an open parallel corpus, by converting, aligning and annotating free online data.

Analysis Examples

[edit]

For an example of how to collect translation IDs from the cxpublishedtranslations API and link them to the parallel corpora, see: https://paws-public.wmflabs.org/paws-public/User:Isaac_(WMF)/content-translation-basics.ipynb