Wikimedia Technical Conference/2018/Session notes/Improving the translation process
Theme | Defining our products, users and use cases |
Type | Evaluating Use Cases |
Session Leader | Santhosh Thottingal |
Facilitator | Leszek Manicki |
Scribe | Irene |
Description: Translation of content between language projects and localization of our products are important for helping new projects add content and for enabling collaboration between cultures. This session looks into the ways we accomplish this now and tries to identify our goals for the improvements we want to make to these processes.
Questions discussed
[edit]Question | Significance | Answers |
---|---|---|
How can localization practices be made consistent and well-integrated with translatewiki and our other language infrastructure? How do we normalize the method of handling multilingual content (different projects vs. single project)? How do we handle variants in a consistent sustainable way? | This will identify the various translation workflows, identifying the necessary elements in the architecture. TranslateWiki is a community maintained project and outside of WIkimedia Infrastructure. This has many impacts (integration and security) that we should evaluate. | Also, do product have any problem with translate wiki, as opposed to incorporating the tool into wikiprojects? Â (outside tool, delay, security, etc.).
Answer: Product doesnât have any big issues, here, but:
|
How do we improve translations and moving content across languages? | This will address changes that need to be made to improve the translation workflow to enable better and faster translations. | Product decision to make: are we using content translation for forking content (one time seeding) or regularly syncing translated content or some hybrid strategy?
|
What are the use cases of machine translation in our current and future projects? Can the machine translation service built for the Content Translation project be used for talk pages, image caption translation in commons, updating existing articles, etc.? | Currently most communication on wikis is within a single language due to how projects are architected. If more collaboration is desired across languages, especially in non-language specific projects, then we need to build tools that support communication including machine assisted translations of conversations. |
|
Important decisions to make
[edit]What are the most important decisions that need to be made regarding this topic? | ||
1. Â Open question: Strategy: Is content translation about forking or synching content across wikis? (see answer to #2 above) | ||
Why is this important?
Will the over usage of English as source wiki will cause digital colonisation(as in minimizing the importance of other language wikis?) |
What is it blocking?
While many strategies look at Translation as a method for filling content gap, without a clear strategy, development cannot proceed coherently. |
Who is responsible?
Product, Technology |
2. How comfortable are we with 3rd party proprietary MT dependence?
Decision: From the discussions, weâre not building our own MT engine, so we will use proprietary engines. But not a single provider considering multiple language pairs which are not supported enough by a single engine. Careful agreements, We are open to help opensource MT engines with our corpora and grants. | ||
Why is this important?
More and more content depending on good will/shared purposes of corporations. |
What is it blocking?
(A certain type of machine translation technology still is not good enough for multiple language pairs. ) |
Who is responsible?
Technology and Product leadership |
3. Translatewiki.net as a critical infrastructure for WMF, but not part of our infrastructure. Sustainability is a question. | ||
Why is this important? | What is it blocking? | Who is responsible?
Platform |
Action items
[edit]What action items should be taken next for this topic? For any unanswered questions, be sure to include an action item to move the process forward. | ||
1. There are use cases for Content translation service. Reading infrastructure and Android to synch with Santosh about use of API in upcoming Android feature. | ||
Why is this important?
The API is available and free to use for anybody. |
What is it blocking?
There seems to be a lack of awareness? |
Who is responsible?
Reading infrastructure |
2. Global templates need to be prioritized, productized. Alternatively semantic map of template parameters is required for translating content across languages | ||
Why is this important?
It prevents adapting content from one language to another. Many templates hold very important data about article, such as infoboxes |
What is it blocking?
Product prioritizing and roadmap definition |
Who is responsible?
Core platform |
3. Translate extension need its technical maintenance addressing the technical, security issues it has(VE integration)Translatewiki.netâs position as separate entity need to be examined to see how to support it better | ||
Why is this important?
Page translation is used for very critical tasks inside foundation(such as strategy, fundraising, policies). No reason to discard it |
What is it blocking?
Page translation not working with VE is a blocker for better editing. Localization Updates are affected by the technical debt. |
Who is responsible?
Product(Language) |
New Questions
[edit]What new questions did you uncover while discussing this topic? | ||
Is it a better strategy for us to develop our own translation engines and software, or to continue working with third-parties and continue trading our data for their engines? And how do we continue sustainable practices if working with proprietary systems? | ||
Why is this important?
If we need to develop our own tools, it is a long process and we need to start building this sooner rather than later. If we stay with 3rd party engines, we need to safeguard our interests so we are not left in the cold. |
What is it blocking? | Who is responsible? |
Detailed notes
[edit]Place detailed ongoing notes here. The secondary note-taker should focus on filling any [?] gaps the primary scribe misses, and writing the highlights into the structured sections above. This allows the topic-leader/facilitator to check on missing items/answers, and thus steer the discussion.
- Background
- 4 types of translation we are doing -
- localization, translating interface messages
- Wiki.net and apps
- Pet project of Niklas in 2006, high numbers of translators nowadays
- Used by phabricator
- Key to making our software available in all other languages
- Page translation - a kind of localization of front-facing banners, technical documents - better for literal translations, no nuance to the language
- Machine translation for translating articles
- Translation service, machine translation of content using html
- Using our own algorithms for this type
- âSmartâ translation on top of
- Check-in of handout / background information
- Question: Is âpage translationâ mentioned in the handout the operation behind âtranslateâ button one can see on wiki pages - Yes
- Main questions of the session:
- How can localization practices be made consistent and well-integrated with translatewiki and our other language infrastructure? How do we normalize the method of handling multilingual content (different projects vs. single project)? How do we handle variants in a consistent sustainable way?
- Used to be slow process as the interface message takes a week to appear in the wiki; localization updates happen each night, which updates the localization cache
- Incident recently of malicious translation
- Fundamental problem is that the localization process needs some core updates -> how can we make this a seamless process from the developer to the user on the wiki
- Q from Corey for product people - should source code translations be managed on the front end?
- Is there a reason on principle for it being part of the core process? Josh says not a lot of interest in that; wants to provide guidance to more effectively translate it. Example of kwix as having expectations of support beyond money and expertise; not a desire to bring the entire system in-house
- Security is a concern - this is done manually - perhaps the bots could be smarter to mitigate the security risk
- Looking at opening up the translation process to have more micro-translations; is also a massive undertaking
- Back to S - the code is out of date and from years ago
- People donât see it as a core or technical contribution, since translation is seen as a non-tech issue by the users
- There arenât active admins OR legal entity for translate.wiki.net - personal liability is a concern
- Subbu: Conflict between these two points - quickness and safety; q about how much security against vandalism there is, which there is basic protections against vandalism
- Adam: Do we have adequate coverage across the language weâre trying to reach; do we have sufficient interface translations; proportionate to the number of users in the language
- Joaquim: Wikis fight vandalism by having fast edits; having a cache in the db where the live messages are kept on the fly
- People treat it like wiki-text where you can mix things; putting in labels telling people not to do this
- Page translation feature: it doesnât work as it should, because you are inserting markers and thatâs a horrible thing to work with; this is not prioritized at the moment. It is used for time-sensitive and important things, like fundraising banners.
- Subbu: Do what extent can content translation be used for these things? Can you integrate machine translation into this?
- No freedom to skip sections of the page, all or nothing; big challenge is that the system has no way to identify these changes and keep them in sync
- Clarification of Subbuâs q - can you use this as an API?
- Josh: if both of these are using [?] mark-up, there are ways of knowing beyond just string text comparison; this element has changed therefore this translation needs to change
- Fundamental problem is the marking system
- If we knew if we could fix this, we knew things would pick up, more people would be invested in helping
- J: we should get rid of this part of the translation system, having software to enforce something that can be enforced by humans, esp given different structure of paragraphs and sentences between
- Action Item: move toward plain strings instead of html
- Action Item: Clean Up the Mark-Up on Import
- Action Item: Back-end security scrubbing (Javascript)
- Josh: product managers donât like this and this has failed to be prioritized as the executive level, we prioritize external impacts over internal; however this is a level of affecting participation, esp since movement strategy documents arenât being translated effectively due to this broken tool
- Machine translation is not âoursâ but is coming from other 3rd party groups; there are not many FOSS options
- It isnât sustainable / how sustainable if we are relying on proprietary services. How do we solve this problem?
- Syncing is an issue that we donât have a solution for at the moment, and needs things to exist.
- Also issue of only being able to translate to a language where that page doesnât exist; sometimes there is a language mismatch where one lang version is very full but another is a stub, you cannot use the current tools to fix that issue
- Relying on local maps to anchor these translations, which is not effective on a large scale
- Discussion of article translation:
- M: For keeping it in sync, how about translating the difs of the revisions?
- subbu: is translation seen as a seeding tool OR keeping wikis in sync? Keeping them in sync doesnât seem sustainable.
- How important machine translation is for us; v important for the arabic wikipedia, machine translation has issues and security risks, but the good starting point is machine translation and working to improve that how sustainable the mt engines are -> idea, start utilizing our own tools like wikidata, which can give us accurate equivalence as opposed to services like Yandex, especially for non Euro langs; also look into the experience of other translation tools open source translation interfaces. Moving away from paragraphs but instead to sentences as it highlights gaps that people who can fill, and can help with synchronizing translates.
- MT should support minority languages better. Wikidata can help translating terms
- Joaquim: depending on the article, translations are bad, which takes longer to fix the original thing than to just translate. Also term-level translation in the editor. We should be aware of digital colonialism; if we are overly relying on translation, we can be perpetuating that, others are âshallow copiesâ of dominant culture wikis
- Jon: Q of forking vs keeping in sync; if you fork in mass, there are maintenance issues; but if you speed it up, thereâs a lot of content in a language that doesnât have the community to maintain it, so the updates would get lost; policing and updating is a hard burden to place on communities; also if we donât have a way to inter-related with these 3rd party systems we will get left behind
- Josh: maps of content changes as opposed to trying to keep everything in sync or just allowing forking. âFlaggingâ of collections of articles across multi language as a possible solution.
- Q: is any of this heuristics work exposed through an API, or is it internal? (its a service) is it by hand (yes) donât globalize templates but use a semantic directory -> we always have a machine map
- Corey: with the translation, we can have a UI that shows changes in other languages, and mixing stuff like the scoring systems into that UI
- Subbu: as a process question, there is an unresolved q of sync vs fork; is there a question we can use to frame this?
- S: This q needs clarification of if itâs only for new articles or if itâs for syncing or what
- Discussion of machine translation as a service:
- Do we see a future where we are developing our own engine
- Now is the time to start thinking about having our own engine; unsure of the time and effort but we have the elements to start, an interface to see this, and other engines to model off of
- Donât want to get stuck in a dependency on other engines
- Maintenance is a big commitment as the languages evolve
- Something to use in our negotiations with google, providing them with a good text base and they pro
- Google was saying, âwhatever data wikipedia has, google has moreâ
- Questions of how parallel things are as opposed
- Why consider not improving one instead of building from scratch; we are doing this; Joaquim has an offer of head-hunting for this
- Cheol: Apertium is a rule-based translator which is not good for translating from english to korean. Statistical translation model or Deep learning would be better for the language pair. Do we have more data rather than parallel corpora such as online translatorsâ activity logs. We can capture translation process such as online editing behavior, such as cursor movement, or substituting terms or hesitating to edit. Can we do a better job than Google using these data or could collaborate with machine translation service provider for research or development with the assets.
- Corey: we are implicitly making the decision that we are not investing ourselves in a machine translation tool. Y/n? Either weâre getting rid of it or weâre working with an engine as part of a deal.
- V: getting good language pairs from one engine is hard (Josh: especially when there isnât a business incentive), better to use several. Build a plan for technology partnerships.
- Do we see a future where we are developing our own engine
- Built an API service that provides translation for devs; translating talk page messages, or translating commons, anything. What are more use cases? What are high priority cases?
- Android app, wants to support caption translations in commons and structured data
- We do know what languages people have expressed an interest in; we can recommend new users or new positions to invest in this translation process and enrich commons, as they might not know about the API
- Action item: readers infrastructure will sync with S on this
- Action item: Term to term translation
- Q from V - how do we prioritize language pairs? Do we have a sense of that road-map?
- S: we do know about highly active language pairs, and tracking when people use language switcher tool; we do have that data (ex. Spanish and Catalan)
- V: a paper was published about this topic by Danny V, have we looked at this? Josh: this is the crazy solution, a wikidata-esque solution
- Subbu: language variance is part of a continuum, we should discuss this further