Jump to content

Topic on Talk:Content translation

Deactivate until major bugs are fixed ?

16
NicoV (talkcontribs)

Could you temporarily deactivate the content translation tool until its major bugs are fixed ?

It's currently creating many damages on articles in production wiki and often requiring other people than the one using the tool to spend a lot of time to fix all the errors. Among the major bugs: many attributes added (id, data-source, data-cx-weight, contenteditable, ...), nowiki tags, internal links with no text at all, useless span tags, ...

An example of an article that requires a lot of effort to fix after CT has added so many errors: https://en.wikipedia.org/w/index.php?title=CJ_Entus&action=edit&oldid=674317311

Pginer-WMF (talkcontribs)

We have identified these kind of issues as critical bugs with high priority. The team is working on them and the most critical ones will be fixed already this Thursday, and we'll keep working on the rest. At the moment with the current volume of published articles we think it may not be needed to deactivate the tool.

We are sorry to hear that the tool is causing extra work, and we appreciate your report since specific examples of these issues are really useful to quickly identify and solve them. 

NicoV (talkcontribs)

This tool has been creating a lot of extra work since its release. Why WMF insists on letting broken tools active on production wikis ? What would be the problem of activating this tool only for a short period of time to identify the critical bugs, and then wait before they are resolved before activating it again ?

On frwiki, for example, about half the edits made with this tool create unnecessary nowiki tags, and it's only one example among the critical bugs. Do you really think it's normal to keep it active when the percentage of problems is so high ? Some of the critical bugs have already been reported more than 3 months ago (T96242, T96467), meaning a lot of extra works for volunteers. For me, users shouldn't even have to ask that this tool should be deactivated until fixed, it should be a natural reflex for WMF team to avoid damaging articles in production wiki.

Magioladitis (talkcontribs)

I confirm that after the activation of this tool, the number of pages that need syntax fixes/simplification has increased.

Pginer-WMF (talkcontribs)

These and similar issues have been discussed before as important bugs to fix. As part of a conversation with the Catalan Community some of them were identified as must-fix issues  before exposing the tool out of beta, but it is the first time that they are presented as an urgent situation for the tool to be disabled even as a beta feature. So I'll try to provide my point of view, and ask for some additional information that can help to better understand the impact of those problems.

Our understanding has been so far that most of the time the users that created the translation where the ones having to fix the issues in the markup the tool produced. This means that if they are still using the tool to create those articles, it is because the benefits of the tool compensate the usual glitches of a tool in active development.


Another important consideration is that Content Translation is mainly used to create new content that did not existed before (the tool discourages overwriting existing articles). So no harm is expected for existing articles. Please, let us know if you find otherwise.


In addition, we expected that when the articles don't meet the quality standards of a wiki, they will be deleted. We are tracking deletion rates as a way to estimate the number of low-quality articles and they are quite low (about 8%).  The problem you are raising (articles with content good enough to keep but with a markup that requires effort to fix) is harder to track; is there a way we can estimate the number of articles that create heavy clean up work for other editors?


Wherever we see that inappropriate content is added to generated wiki text, we are quickly fixing them and deploying the fixes every week. So the tool is rapidly improving on that aspect. For example we are deploying https://gerrit.wikimedia.org/r/229313, https://gerrit.wikimedia.org/r/229106, https://gerrit.wikimedia.org/r/228242  and https://gerrit.wikimedia.org/r/228773 on this Thursday to address these kind of issues.


We estimated the general overhead for communities to be low since the tool is exposed to few users through the beta feature system (and those tend to be experienced users normally). For Catalan Wikipedia (which is the wiki with more articles created with Content Translation) there are 15 new articles translated per day. That wiki has about 100 new articles per day, so Content Translation represent a small fraction of the articles produced which we assumed not to represent a heavy workload (if you consider that only a percentage of those may have issues not fixed by the article creator). For English Wikipedia, Content Translation is generating 10 articles per day (out of about 1000 new articles), with much more active editors.


The tool is already useful for many users, and great articles have been created with it. Thus, if we disable the tool we will prevent these users to create great content with it too. Interrupting the service would be a really major measure and while we do not want it to cause disruptions as pointed out in your comments, we would like to work towards cleaning up these issues on a priority as they are discovered. There is a counter dependency here for which we appreciate the feedback we get from the community as the one you provided.


We have been trying to interact as much as possible to compile examples of the issues that are causing impact (good and bad) and it is important that we can get a better picture of how they are affecting different people in the different communities to identify next steps. Feel free to use this discussion page or Phabricator for adding further details on these kind of issues.

NicoV (talkcontribs)

Well, I don't have the same feeling as you do it seems.

My experience on frwiki is that users who use CT are often not the ones that fix the article, even if the tool is only for beta testers. For example, I checked the last 5 edits that raised the nowiki abuse filter (and I just needed to check the last 6 articles edited with CT):

  • CornerShot (nowiki, useless span tags, internal links with no text, templates incorrectly transformed to HTML, ...) : not fixed by the translator
  • The Beatles 1966-1970 (nowiki, internal links with no text, templates incorrectly transformed to HTML) : not fixed by the translator
  • Dryopteris intermedia (nowiki, internal links not expanding on all the word, CT attributes, templates incorrectly transformed to HTML) : not fixed by the translator
  • Aéroport international de Paphos (nowiki, span, useless prefix for templates) : almost fixed by the translator, but he missed a span
  • Aéroport de Barnaoul (nowiki, span, useless prefix for templates, CT attributes, ...) : fixed by the translator (same translator as above)

The 6th edit, Parc Provincial du Lac Chan, which didn't raise a nowiki abuse filter also contains problems and has not been fixed by the translator: missing refs, br tags inside external links or internal links

So on the last 6 articles edited with CT, all of them were problematic...

Also, are you sure that the tool is only accessible through the beta feature system ? Because, for example, recently there was a wide mailing done by WMF researcher to suggest translations to contributors, with direct links to CT tool. Did all the recipients had activated the beta feature system ?

I don't deny that the tool may be useful, but disabling the tool until is fixed is not "preventing users to create great content with it", it's just waiting a bit for the tool to be useful but without the damages that currently occur. And for me, damages on new articles or on existing articles are still damages that need to be fixed and that lower the quality of the encyclopedia. For the six examples given above, there is no reason to delete them, because there's actual content, just poor formatting, so it's clear that basing your statistics on article deletion won't warn you about this.

You claim that you are quickly fixing problems is false: a lot of them were reported more than 3 months ago and they are still present. If the fix was really quick (1 week), then I wouldn't be asking for deactivation. It's because I still see the exact same problems more than 3 months after reporting them that I ask for deactivation. Just a rapid math: about 10 articles with problems a day for 3 months means about 900 articles with problems created...

Runab WMF (talkcontribs)

Hello @NicoV Thanks for your feedback and for your patience to respond to us with much clarity. We can definitely confirm that the tool is not available for users who haven't enabled it from their preference and secondly it is not at all available for non-logged in users. The emails that a number of editors received was part of a project that the Research team is doing to develop a way to provide suggestions about articles that editors may be interested in.

Of the 2 tickets you mentioned earlier, the <nowiki> issue was definitely resolved but very recently there was a regression spotted on Parsoid due to which multiple services were affected by this problem again. Content Translation is one of them. Of the other issues reported in the second ticket, some of this has been resolved by fixes made this week and should be available from tonight onwards. I can find one other issue reported by you which is still open (related to templates). Are there any other that we are missing? Thanks.

NicoV (talkcontribs)

@Runab WMF

I will try to check CT edits this weekend to see what has really been solved, but honestly, taking a time off to properly fix the most prominent bugs would be well appreciated. It's tiring to spend time (sometimes it's a lot of work just to fix one article given the extent of the problems) fixing problems that could easily be avoided by taking a cautious approach (activate for a short period of time to discover problems, then deactivate until problems are fixed, then repeat)

So, you mean that all the problems listed in my previous post are fixed by tonight deployment ?

Including: nowiki, span tags, internal links with no text, internal links not expanding on the entire word, CT attributes all over the place, missing references, br tags inside links, ...

And that templates problems may still be present: templates incorrectly transformed into HTML, templates prefix

Runab WMF (talkcontribs)

Hello @NicoV, a summary of the changes that were implemented over last night and early this morning are is here: https://cxupdate.wordpress.com/2015/08/07/2015-08-06/ . Some of the tag related errors have been fixed. The problem with nowiki is in progress. We are working with the Parsing team about this and hope to have it solved by next week. However, we are testing to find things we may have missed and it would be really great if you let us know if you see anything breaking which should otherwise not break. The list of articles you provided earlier was very helpful for us. Thanks.

NicoV (talkcontribs)

I just check the last edits made with CT on frwiki. It seems to have improved, but there are still a high percentage of edits that result in things that shouldn't be in the article:

  • Very often:
    • unnecessary "Modèle:" when using templates
  • Gerold Huber:
    • strange succession of quotes : 10 successive single quotes
    • unnecessary span tags
  • Rio Mendihuaca:
    • punctuation included in internal links: <code><nowiki>[[Parc national naturel de Tayrona|Parc national naturel Tayrona.]]</nowiki></code>
    • strange div tags around references block, which is also strange as it is an opening tag and then a closing tag
  • Werner Gura:
    • unnecessary span tags
    • italics around whitespace
    • italics/bold including whitespace after word
    • strange id in blockquotes

So I maintain my suggestion to deactivate until bugs are fixed.

And suggestions for a cleaner wikitext:

  • Put a blank line before the DEFAULTSORT + categories block
NicoV (talkcontribs)

@Runab WMF

I checked again all the edits made yesterday on frwiki : on the 5 edits, 5 have problems. 100% of edits having problems is a very high percentage, don't you think fixing bugs should be done before continuing creating damaged articles ?

  • Helena Kozlova: unnecessary blockquote tags (with strange id), italics around just punctuation, punctuation included in internal links, references tag put inside div and using strange syntax <reference></references> instead of <references/>
  • Joe Belfiore: unnecessary div tags
  • Aéroport de Kirov: unnecessary span tags, span tags with CX attributes (cx-segment, data-segmentid), prefix in templates call
  • PANH: wikitext seems ok, but results shows calls to non existing templates
  • Aéroport de Bratsk: unnecessary span tags, span tags with CX attributes (cx-segment, data-segmentid), prefix in templates call
Magioladitis (talkcontribs)

What about disactivating for 3 weeks until we have some progress? Visual Editor guys claim that now have fixed the span problems using a new version of parsoid. This is a step forward. I think we could ask this from the content translator too.

I do not expect all major bugs to get fixed but I think now there is enough feedback for the programmers to work for a month off-site until some things are fixed. The CT programmers work fast and 3 weeks it is a fair time to have many bugs fixed and new code to be deployed.

Yesterday, another guy called me to complain that the text they were writing in content translator was lost. There are a lot of problems.

Moreover, we are some people working on syntax fixes and these days we get more workload than we can handle. Content translator and VE are giving us nightmares at the moment.

Let's resume from September.

This post was hidden by NicoV (history)
Magioladitis (talkcontribs)

@Pginer-WMF I am not sure that only new articles are created. I see here content added in already existent article:

https://en.wikipedia.org/w/index.php?title=Andrea_Cedr%C3%B3n&type=revision&diff=674875564&oldid=671804063

Santhosh.thottingal (talkcontribs)

@Magioladitis, Content translation allows translating an existing article. The translation will be an overwrite. Translator should do this carefully only when it is really necessary. A typical usecase is expanding stub articles. While doing this Content translation warns the translator three times. (1) While choosing the target title as an existing title - before starting translation (2) When translator enters to translation tool with the source and target title selections (3) While publishing, content translation informs the user about overwrite and ask if translator want to publish under User namespace. If a translator overrides all the three warnings, for good or bad reason, an overwrite happens.

Magioladitis (talkcontribs)

Btw, one more example that supports my argument about experienced editors..

https://en.wikipedia.org/w/index.php?title=Alfred_Berengena&oldid=674966784

"Alfred began to touch the battery to a very early age"...

Reply to "Deactivate until major bugs are fixed ?"