Jump to content

Traducción de contenidos/traducciones publicadas

From mediawiki.org
This page is a translated version of the page Content translation/Published translations and the translation is 15% complete.

La información sobre las traducciones publicadas es generalmente útil para los desarrolladores de traducción automática y otros para diferentes propósitos, como la extracción de terminología y la investigación translingüística. La traducción de contenidos tiene como objetivo proporcionar datos sobre las traducciones bajo una licencia abierta. La cantidad y los detalles de los datos se mejorarán con el tiempo. Esta página muestra el estado actual.

List of published source and target titles

La traducción de contenido tiene una API para obtener la lista de todas las traducciones publicadas en diferentes idiomas.

  • Lista de todas las traducciones publicadas en todos los idiomas. Ejemplo
  • Lista de todas las traducciones publicadas entre dos idiomas. Ejemplo

Actualmente, la salida de la API devuelve los siguientes detalles (ilustrado con un ejemplo).

{
  "translationId": "510",
  "sourceTitle": "Tequendama Falls Museum",
  "targetTitle": "Casa Museo Salto de Tequendama Biodiversidad y Cultura",
  "sourceLanguage": "en",
  "targetLanguage": "es",
  "sourceURL": "//en.wikipedia.org/wiki/Tequendama Falls Museum",
  "targetURL": "//es.wikipedia.org/wiki/Casa Museo Salto de Tequendama Biodiversidad y Cultura",
  "publishedDate": "20151006230043",
  "sourceRevisionId": "35676",
  "targetRevisionId": "7689875",
  "stats": {
      "any": 0.93459552495697,
      "human": 0.67469879518072,
      "mt": 0.25989672977625,
      "mtSectionsCount": 2
  }
}

The stats data shows the percentage of translation completion. human indicates manual translation percentage. mt indicates machine translation percentage. Any edits to machine translation output are considered as manual edits. The percentages are calculated at section level. any indicates the total translation (any=human+mt). Content Translation does not demand full translation of the source article. Users can freely translate as many or as few sections as they want. mtSectionsCount shows the total number of translated sections. These stats are also used for abuse prevention (read more about the percentage calculation in that page).

Parallel corpora

Along with the new articles created using translations, the source and translated articles are good sources for parallel text. Content Translation collects these and makes it available for everyone. Machine translation developers can use this to train their machine translation systems. Content Translation also captures the alignment of sections in source and translation, and in some cases even on sentence granularity using HTML markup in the sections. Content Translation does not do any kind of automatic alignment and the provided alignment is only best effort based on how the connections were preserved while translation happens. When automatically aligning the sentences, it is good to remember that the translations do not necessarily match 1:1.

API

To access the parallel text of a single translation, there is a separate API. First, one should know the translation id. This can be obtained from the cxpublishedtranslations API explained above. To get the section level aligned parallel text, use contenttranslationcorpora API.

You can see that the output is JSON formatted and contains section level contents. A section is paragraph or headers or figures. Technically a block level element in HTML. Every section contains up to three versions

  1. source: The source content.
  2. mt: The machine translated content. If the language pair involved has a machine translation service and translator used it, this section in output will have unmodified machine translation of source section. It will be empty if machine translation was not used.
  3. user: The final translation by user. This will be an improved version by manual edits on top of machine translation. Or even translation from scratch if there is no MT.

By default, the section contents are HTML. But if you prefer to get plain text version of each section, use striphtml argument in the API.

If you wish to get only source and user versions, use types argument. By default its value is source|mt|user

Note: The output of this API will be empty for old translations (before 2016-01-22). This is because the API and the required infrastructure was introduced only at that date. We did not capture the parallel text for old translations. But if you have a good aligner, you may still use the real article pairs from Wikipedia using the output of cxpublishedtranslations API.

Dumps

Because accessing translations one by one is inconvenient, we are providing the translation dumps in tmx and json formats. They can be downloaded here. Dumps are available in tmx, json format. For large translation pairs, separate dump files are present. For example, cx-corpora.ca2es.text.tmx.gz is the parallel corpora dump file for Catalan to Spanish translations. For smaller language pairs a single file with all languages are provided. They are named as cx-corpora._2_.text.tmx.gz.

Note that dumps (unlike API data), only include translations where editors have made some changes (i.e., the "human" value is greater than 0). The reason for this is that dumps are focused on capturing the corrections users made to machine translation systems to improve them.

External repositories

Data about published translations has been integrated into the OPUS project. The OPUS project goal is to create an open parallel corpus, by converting, aligning and annotating free online data.

Analysis Examples

For an example of how to collect translation IDs from the cxpublishedtranslations API and link them to the parallel corpora, see: https://paws-public.wmflabs.org/paws-public/User:Isaac_(WMF)/content-translation-basics.ipynb