Help:Extension:Translate/Translation memory architecture/nl
In deze documentatie wordt de huidige architectuur (zoals op 10-2020) uitgelegd van een vertaalgeheugen dat is geĂŻmplementeerd in de extensie Translate. De doelgroep is ontwikkelaars en andere mensen die geĂŻnteresseerd zijn in of opgedragen zijn om de architectuur van het vertaalgeheugen te verbeteren.
Wat is het vertaalgeheugen
Het eenvoudige doel is het vertalingsproces te versnellen door vergelijkbare eerder vertaalde delen te suggereren.
Databasis 1:
id | translations -------------------------------------------------- 1 | en: Are you a bunny rabbit? | fi: Oletko sinÀ pupujÀnis? -------------------------------------------------- 2 | en: Who are you? | fi: Kuka sinÀ olet?
If you are now translating a string âAre you a humanâ, the translation memory may return one or more suggestions. Usually a score is given how closely they match to the original string, and this score is also used to select the best candidates.
Translation memory output 1:
Query | value: Are you a human? | source language: en | target language: fi ---- Sug 1 | value: Oletko sinÀ pupujÀnis? | match: 80% ---- Sug 2 | value: Kuka sinÀ olet? | match: 60%
Why is a translation memory hard to implement
The most essential feature of a translation memory is the function to calculate the score (later: scoring function). The function should aim to find a balance of:
- finding the best suggestions as judged by the translators
- being performant in the senses that it returns suggestions in a reasonable time and does not consume too much computing resources
Why is finding the best suggestions difficult
Both of these are difficult things to do. What is a best suggestion is a subjective measure, and any kind of large scale human evaluation is laborious to execute. As an illustration of the challenges here:
Database 2:
id | translations ---- 1 | en: Where is the highway? | fi: MissÀ se maantie on? ---- 2 | en: Germany has a number of old highway strips | fi: Saksassa on useita vanhoja lentokoneiden varalaskupaikkoja ---- 3 | en: The old highway strip was 1111 meters long. | fi: Vanha lentokoneiden varalaskupaikka oli 1111 metriÀ pitkÀ.
Which one would be the best suggestion for âWhere is the old highway strip?â. A naive system would return document #1 since it has four words in common. However, the translator likely can write the translation of âwhere is ___â very quickly, but may not be sure what is the best translation for âhighway stripâ. In this case documents #2 and #3 would have provided an answer to this question. Of these #3 would potentially be better match against the likely translation of âMissĂ€ se vanha lentokoneiden varalaskupaikka on?â given it has long section of words that can be used directly âvanha lentokoneiden varalaskupaikkaâ (apart from fixing the case of one letter) while the the document #2 would need correction of multiple words to be grammatically correct. Fixing them takes more time than removing and adding complete new words.
Why is it difficult to make a performant scoring function
Performance is determined by the complexity of the scoring function. A simple equality comparison using a hash (which can be precomputed) is very fast, and it doesnât really matter how many times we run it. But something more useful, such as a Levenshtein edit distance, takes (usually, depending on the implementation) polynomial time depending on the length of strings that are being compared. So when we are trying to calculate a score for a long paragraph, it can take a good portion of a second for one scoring, so it is not possible to do thousands of such slow comparisons.
There are, of course, some optimizations, such as filtering out candidates that are too short or too long compared to the query string, meaning they would never meet the minimum score.
How Translate implements translation memory
The Translate extension has a component named TTMServer. Itâs etymology is possibly Translate (extension) Translation Memory Server. TTMServer is two things:
- Multilingual translation search
- Translation memory
These two features currently share the same data and backend, but they perform different types of queries against it. Translation search is relevant in the sense that we are working under the assumption that translation search and and translation memory can continue sharing the same dataset and that translation search functionality is not degraded by the way of reduced functionality or reduced performance.
TTMServer provides an abstraction (a poor one --author) for the rest of the Translate extension to update, search or perform translation memory query. In theory it can support multiple backends, but in practice there is only one.
There are two backends: ElasticSearch and Database. Database backend does not support translation search and it is provided only as a convenience for development environments. There used to be another backend for Solr, but it was removed as we did not have resources to maintain a backend we do not use ourselves.
TTMServer storage schema
For this section, do familiarize yourself with what an index means in ElasticSearch, for example from https://www.elastic.co/what-is/elasticsearch.
The translations (and definitions) are stored in an index. The index makes no distinction between definition or translation, so you can basically search from any language to any language, though it is not possible to have per-language analyzers with this schema.
The schema can handle multiple wikis sharing the same index, which is often desirable. The exception to this is private wikis, which should have their own private index.
Glossary
- Message: A piece of text with a source language and all itâs translations. In the case of the schema, a message can have multiple versions.
- Document: One version of the message in one language.
The schema has the following fields:
_id | Also known as global id. It is a token in the format of wikiId-localId-revisionId/languageCode .
|
content | The text value of the definition or translation. |
group | Which message group this string belongs to. May not be globally unique. |
language | Language code for the content. |
localid | Page title of the âmessageâ. |
uri | Link to the message. |
wiki | Wiki database identifier |
Here is an example entry:
_id | translatewiki_net-bw_-Wikimedia:Xtools-projects-8849980/de
|
content | Projekte |
group | xtools
|
language | de
|
localid | Wikimedia:Xtools-project s
|
uri | https://translatewiki.net/wiki/Wikimedia:Xtools-projects/de
|
wiki | translatewiki_net-bw_
|
How is the index updated
The index supports having multiple different versions of the same message. Consider the following situation. A message is translated to many languages. Then the definition is changed. Translations are marked as outdated and updated.
When the definition is updated. We add a new document to the index. Since the revision number is different, it wonât override the existing one. When the translations are updated, any previous versions of the translations for that message are deleted, and a new document is added for the latest version.
Fuzzy translations are never inserted to the index.
How is the index queried
The current algorithm is complicated, but Iâll try to explain it in parts. The high level description is:
Inputs: text, sourceLanguage, targetLanguage, threshold
Algorithm:
- Matching Query: Query the index where language is sourceLanguage, content âis similar toâ text, ordered by âscoring functionâ where score is higher than threshold. Return all matching documents.
- Construct searchTerms to retrieve matching translations by replacing language code in the
_id
with the target language - Retrieving Query: Query the index where _id is any of searchTerms
The definition for âis similar toâ uses ElasticSearchâs âfuzzy_like_thisâ query. Do note that this is deprecated and has been removed in a newer version of ElasticSearch.
The definition for âscoring functionâ uses levenshtein_distance_score to calculate the edit distance and use that as the score.
An astute reader can already notice that major issue here: we do a slow matching and scoring query to a long list of documents, which can be pointless if those documents do not have any translations.
The current algorithm tries to work around this by doing the above algorithm twice. First it takes top 100 results from the Matching Query, hoping they are sufficient to return enough[1] results from the Retrieving Query. If this is not the case, it queries for 500 more (doing redundant work) and merges the results.
- â Having 6 or more unique suggestions