For realizing Parsoid. (= If those pages like Performance be translatable, I will jump on and put them into ja (; |
User talk:SSastry (WMF)
The Technical Barnstar | |
All your work on Parsoid. <3 DAlangi (WMF) (talk) 20:41, 19 August 2022 (UTC) |
Hear, hear!
Thanks! :)
Re , would it be okay if I added {{nobots}} to that page, or would it interfere with your tests? Are there other pages that should be tagged?
Sure. I mostly removed it because the url was fictitious and it didn't matter what it was.
Hi! Now that Extension:TemplateData loads the list of used hooks from the extension.json file, did it become clear that the extension has been using an undocumented hook for almost half a year now. You added ParserFetchTemplateData
in 956367fb9071 in last November; could you please create its documentation at Manual:Hooks/ParserFetchTemplateData? Thanks in advance!
We are still not sure if we want to support that officially. It is currently a hack for now and it may go away. So, we didn't want to give it more first-class status till it is clear what will happen with it. But if it really matters, we can document it and note that it is not meant to be used by anyone outside Parsoid since it might go away.
I see. I’m not sure what the best solution in this case is; probably going back to the previous status quo and manually providing the hook list manually on the extension page as template parameters (with a wikitext comment explaining the situation)? I’d really like to see an explanatory comment in the extension.json file as well—lack of comments is exactly why I hate the hype of having JSON as a configuration file format, it’s simply not suitable for that purpose. :(
@Tacsipacsi: The fact that the hook is undocumented does not mean that it does not exist, and the extension does not use it. If the hook has no documentation, then Manual:Hooks/ParserFetchTemplateData should be created with {{Undocumented MediaWiki Hook}}, rather than including a list of hooks in Wikitext which is doomed to fall out of date. (And I know that will happen because it did; I converted the page to load hooks from the repo as part of a batch cleanup of Category:ResourceLoaderTestModules extensions after phab:T232875
You definitely get a kitten too per Topic:Ugc7j7k7aoum5jfq
:-) Thanks!
Tidy bugs without Remex assigned -- I don't think any of those need to go anywhere. The rest are tagged with Remex where they impact the current cleaner either by my testing or my guessing ;). Verify you like all those tasks there. phab:T67747 is probably the only one you might have an opinion on given your other close today.
Otherwise, I think the project has been Tidyed nicely.
Thank you very much for getting into the weeds and tidying it all up. :-) I'll take a look and we'll fix things as appropriate.
Hi SSastry,
I'm trying to fix the lint errors at tr.wiki. Normally, after fixing an error on a page: 1) that page would disappear from the list (e.g. from "Special:LintErrors/multi-colon-escape"), 2) clicking "page information" would no longer show that lint error, and 3) lintHint would no longer show that lint error. But today, after fixing a lint error, 1) & 2) don't happen. Examples:
Do you know what's going on here? -- Curious (talk) 19:39, 3 March 2018 (UTC)
Thanks for flagging this. Others have raised this as well Topic:U8p5sxysk9eblcob and Topic:U8q07xrh4r67iql4. Something might be temporarily broken with the update mechanism -- normally, it takes a little while for updates to propagate, but 24 hours is too long. We'll investigate this tomorrow.
This page would have been listed in the ru:Special:LintErrors/multiple-unclosed-formatting-tags category. It won't show up any longer now that someone has fixed it.
A little bit of something else. Is this normal when wrapping spreads even outside the parent block? The tag wrapping starts inside the ul, but does not end when the ul tag is closed. Continues to wrap everything up to the end of the page. Previously with Remex, it was not like this.
In the example, it affects the text "after"
before *a *<small>b *c *d *e after
Also, the unclosed tag S in the middle of the discussion page in the old parser maybe not line-through all other topics and text to the end of the page. Probably it should at least ever stop line-through or it is normal now?
Yes, this behavior is part of the HTML5 parsing spec. But, note that even Tidy did the same thing -- see below.
<ul> <li>a</li> <li><small>b</small></li> <li><small>c</small></li> </ul> <p><small>after</small></p>
If you replace the <small> tag with <s> tag, the same thing happens. What is different is when there are multiple such unclosed tags. In Tidy, the effects are limited or it assumes the second unclosed tag was a closing tag. With Remex (like any HTML5 parser), the effects accumulate.
Thanks for addressing the template parameter order dirty diffs! (From Phabricator T179259.)
I was looking at your Gerrit patch for it. If we have parameters A through Z, it looks like inserting B into XYZAC aborts as soon as it sees X, yielding BXYZAC instead of the more natural XYZABC. I think we can do a better job of keeping related parameters grouped together with a modest increase in effort.
Based on the variables i and j in your code, walk through all values of j and remember the j with the smallest absolute_value(i-j). If no such j exists, append the new parameter. If j>i then insert before j, otherwise we have j<i and we insert after j.
Inserting B into XYZAC would lock onto A or C, and put B between them.
I just read the Gerrit replies and saw you mentioned this there. Cool :)
I also saw the comment about it being O(N^2). I'm pretty sure C. Scott Ananian's idea of using standard quicksort with a custom sort function will return garbage. The sort function generally assume the compare function is transitive. That's not true for the compare function he provided. Existing parameters could get scrambled when they get sorted around the new parameters. The lack of transitivity could result in a complete nonsense result.
If it's worth the effort to go for an O(N LogN) algorithm, I think I see how to do it. However I'm not sure which argList functions are O(1) or O(N), which is important. I'm currently trying to Google that.
You're right, the algorithm as I sketched it does not provide a transitive comparison function.
But here's a version which is O(N ln N), using an initial linear pass over the templatedata arguments to do the equivalent of the "search and insert":
https://jsfiddle.net/ovwpwkr3/
It doesn't require the sort to be stable.
Not to be the grumpy old man, but as a software engineer with almost 40 years' experience, let me recommend that you start with an obvious, easy-to-verify n^2 approach, and think about clever n log n approaches only later IF performance turns out to matter, which it very well might not, and anyway you can use the n^2 implementation to test the clever implementation.
I agree that fussing over n^2 efficiency for small n is probably pointless. And the most important thing is fixing the reordering of existing parameters.
However the current question is whether to fuss over the results when adding new parameters to existing parameters. The WMF is using a nice approach for simplifying the code by working at a higher level of abstraction. They basically just sort a list rather than fiddling with parameters individually. It's short, understandable, and it cleanly sweeps messy details under the rug. (Duplicated parameters can be an annoying corner case.) However it disregards the relationship between parameters added at the same time. If the existing parameters are ZA and you add MN, you get the ugly order NZAM. N has a weak pairing with Z, M has a weak pairing with A, and the strong MN pair was ignored and split.
It's possible to do better, but I hesitate to advocate an ugly solution. The most straightforward approach appears worse than N^2, and the efficient approach would be a hard-to-decipher algorithm. The current code solves the main problem, and generally gives good results when adding new parameters.
...and https://jsfiddle.net/ovwpwkr3/1/ handles arguments further from their "anchor" a little better.
@Cscott I did a bit of testing. Assuming the existing parameters are ZA and assuming a full alphabet in template data:
var orig = "ZA"; var tmpd = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
The first version of the code produces ZABY (should be YZAB), and the second version produces MZAN (should be any of MNZA, ZMNA, or ZAMN).
The more apparent approach is to place the most closely-adjacent parameters first, and progressively add parameters. However it may be more big-O efficient to take an opposite approach. Consider parameters in a chain, in templatedata order. Cut the chain between existing parameters, leaving each new parameter attached to exactly one existing parameter. o = new parameter, * = existing parameter.
o-*-*-o-o-*-o
Walking down the chain from left to right:
- There is nothing to do until we reach the *-*. Simply split at that point:
- o-* *-o-o-*-o
- As we walk from the second * to the third * we remember the step with the largest templatedata distance between parameters, then cut at the biggest gap. For example:
- o-* * o-o-*-o
- There are now three chains, each containing exactly one existing parameter. Put the chains in proper order for the exiting-parameters.
Thanks @Alsee and @EEng for your suggestions and @Cscott for your updated patch. Please see my latest comment on the gerrit patch. I personally prefer iterating towards the complexity that is needed by starting with something simple that is acceptable.
@Alsee take a look at https://gerrit.wikimedia.org/r/389852. I'm pretty happy with MZAN as the output. You're always going to have some weird corner cases, but "sort with closest original parameter" is simple and understandable.
Hi, I see you're listed as Parasoid lead.
Recently I've submitted quite a few bug reports, and some WMF staff have expressed puzzlement at why I have such a miserable Flow experience and why I run into a steady stream of problems, up to and including Unable to parse content due to a Parsoid failure. I think I've figured out a big issue. I was looking at Parsoid/Round-trip_testing. It looks like the WMF attempted to collect a large body of representative test data. parsoid-tests currently claims:
- 100% parsed without errors,
- 99.84% round-tripped without semantic differences, and
- 74.56% round-tripped with no character differences at all.
But if your body of test data isn't actually representative of real world use, you're going to miss a lot of issues that rarely or never appear in the data you did test against.
The majority of edits are not saves.... the majority of edits are actually done within an edit-preview-cycle, possibly ending with a single save. I can tell you, you're going to find a *lot* of stuff that shows up within the edit-preview-cycle that rarely shows up in a final save. It seems you also didn't test against edits outside of article space, so there's another big chunk of stuff you're missing. (In Flow, merely clicking between the two editing modes generates a round-trip which routinely mangles what I had... and some of those no-semantic-difference changes are gross).
Is there any chance you can collect a large body of preview-diffs to do a round-trip test against? Preferably one test-set from article space, and one test-set sampled from all non-article space?
@Alsee Thanks for reaching out directly about this and your suggestions about possible gaps in our test coverage.
As of July 2015, the round trip testing tests against 160K pages coming from a collection of 30 wikis. enwiki has the largest representation @30K pages. In addition, 30% of the 160K pages were picked from the RC stream, so they are pages being edited. So yes, you are right that we did attempt to collect a large body of test data that also attempted to be representative.
I think you are making the observation that for Flow, this is not sufficient. For starters, I want to look at the bug reports you've filed so I know what is going on. It would be helpful if you point me to that. It is not necessary, but steps to reproduce them would be even better. Once I have a chance to look at that, it would help us figure out how to improve our test coverage (including but not limited to looking at non-article space wikitext usage).
My other reply got mangled. I'll leave that post as-is, in case you want to investigate the mess. Flow is not reliable communication medium for discussing Flow bugs because it doesn't have a stable TrueText like wikitext does, and because trying to show troublesome Flow content generally triggers problems. I'll do my best build a non-damaged version of my original post here. Oh jeeze, I copy-pasted it all here and immediately got those little arrows, and a round-trip immediately resulted in new semantic changes. For example the text used to contain the word *was* and a roundtrip turned it into bulletpoint was*
Original post below:
----------------------------------------------------------
I'll probably split separate subjects into separate replies.
I think you are making the observation that for Flow, this is not sufficient.
Hmm. My first thought was that this issue would spill over to article editing too.... but on second thought maybe not much. I don't really use Visual Editor and I'm still trying to think through the actual-use implications. I think the fact that VisualEditor can't round-trip edit modes during an edit will mostly avoid the nasty trigger situations. But there are definitely latent issues you're not seeing, and I think real-use article-editing errors rates will still be a bit higher than your testing suggests.
The great thing about real wikitext is that newbies can type all sorts of malformed crap into it.... experienced editors can type a malformed half-edit... anyone can accidentally or experimentally type all sorts of malformed crap into it.... and it cheerfully gives you a reasonable best-effort render. No errors, no complaints, the worst that happens is you don't get what you hoped for. Then you play around with your wikitext until you do get what you want.... and you save a single relatively-clean generally-well-formed edit.
In Flow it's common to repeatedly flip edit modes while working up a single edit. That can hit Parasoid with all sorts of random malformed wikitext. And the worst part isn't a rare Parsoid Parse error. The worst part isn't even a fluke semantic change. The worst part is that Parasoid constantly throws your wikitext out the window. Your wikitext is gone, and Parasoid gives you a new mutated blob of wikitext.
Just a single <nowiki>
Becomes <span><nowiki></span>
which becomes <span><nowiki></span>
I lost the edit that triggered the Parasoid Failure. I had reflexively tried to make italics while in visual mode and I got something like this:
<nowiki>'blah' blah blah ''</nowiki>blah blah
The weird placement of the first nowiki tag caught my eye, and I made some random minor edit to see what would happen. I might have unbalanced the nowikis. I toggled the edit mode through 4 or 5 round-trips, it mutated each time. On the final roundtrip Parasoid itself mutated it into something that triggered a Parasoid failure.
(Then I started talking about bugs in that post.)
Ok, I avoided any roundtripping on this post. What I see on my screen right now looks correct. If I *don't* add a new reply, it's because this save looks correct.
You can now remove that other mangled post. I know what is going on. I filed https://phabricator.wikimedia.org/T115236 to figure this out -- you uncovered a bug in Flow <-> Parsoid interaction. Once we figure this out, the weird normalizations you are seeing should go away.
The post above came out correctly.
I can report a reproducible semantic-change test case:
- Starting with Flow in default Visual mode.
- Go to the damaged post below(permalink). Click edit.
- CTRL-A, CTRL-C to copy everything.
- Click on a REPLY link somewhere, and CTRL-V paste. (You should still be in Visual mode)
- Notice that there are no bullet points anywhere.
- Click for Wikitext mode. Click back to Visual mode.
- Notice that there is now a bulletpoint.
The arrow lines that you see (and reported) is a Firefox specific issue with copy-paste in Visual Editor (Visual Editor is embedded inside Flow). I cannot right away find the bug report for you. But, in any case, the scenario of the bullet being introduced is a bug in Parsoid (that should never happen -- we should have added a nowiki around the bullet because of the line-breaks introduced by the copy-paste. We'll investigate how to reproduce this and fix this.
I just found this on Jimbo's talk page:
There are still lots of improvements yet to make, not least integrating wikitext and visual editing together properly and removing the hack of having a second edit tab that jumbles up the interface... Best, -Elitre (WMF) (talk)
It sounds like the plan is for general editing to work like Flow editing, where you get rid of the real wikitext parser and run everything through Parasoid. If so, then I need to revise my comment above, where I indicated that this issue isn't going to massively spill over to article editing. If you try integrating wikitext and visual editing together properly by running everything through Parasoid, three things will happen:
- It's going to start massively mangling articles the same way it mangles Flow messages. Trying to deal with typo-ridden fragmentary incomplete edits is going to hammer all of Parsoid's failure cases.
- Experienced editors are going to scream bloody murder. Even when there's no-semantic-changes, it's WTF why does it constantly destroy what I wrote.
- Newbies will be severely impacted and probably driven off, when Parsoid makes it impossible to figure out how anything works. Parsoid throws away the wikitext they entered, and gives them back new and different wikitext. They will have no clue why their wikitext changed. The re-written wikitext may use stuff they haven't figured out yet. For newbies trying to learn and experiment, a simple preview test turns into a WTF-just-happened-I-can't-understand-this-I-quit.
Note that this bug you noticed here is specific to Flow. https://phabricator.wikimedia.org/T115236 does not affect Visual Editor. In addition, there are other techniques that Parsoid uses for Visual Editor that prevents dirty diffs. Between VE and Parsoid, we will nevertheless be doing a lot of testing to make sure something like this does not happen, especially since you have specifically alerted us to this possibility.
I reported copy-paste completely mangling everything. I assume it will get fixed.
I reported reverting-an-edit completely destroying the original message. I assume it will get fixed.
I reported that the page you see on first save doesn't match what everyone sees on future fresh loads. I assume it will get fixed.
I reported that the editing interface crapped out, making editing virtually impossible. I assume it will get fixed.
I reported on summarize-edits causing history links to generate fatal exceptions. I assume it will get fixed.
I reported on all of the menu items becoming inaccessible if the content in the post is "too big". I assume it will get fixed.
I reported that oversight was never implemented. (Because my private info got sucked into a copy-paste test post.) I assume it will get fixed.
I reported test-results that mixing top-posting and bottom-posting turns large discussions into incomprehensible spaghetti. I *hope* there's a plan to fix it.
I can't even begin to remember everything I've reported - every time I touch flow everything explodes faster than I can report it all.
When you say "making sure something like this does not happen", are you merely saying you plan to get the mangling rate down to the figures claimed for Visual Editor? Or are you saying it's actually going to work as well as our current wikitext editor? (i.e. evil gremlins never-ever-ever creep in and rewrite what we wrote?) Because as I understand it, there's not much prospect that will ever get fixed with Parasoid.
I can only tell you what we'll do on the Parsoid end, not about Flow.
As for "making sure something like this does not happen", I cannot promise "never-ever-ever". Bidirectional conversion between wikitext and HTML is hard, and we have come a long way there. The problems you saw in Flow posts that frustrated you is because of a bug (which I linked to above) where some information is not getting to Parsoid. But, I appreciate your confidence in our abilities :-), but let us wait till we get there perhaps before making final judgements. But, more seriously, I do appreciate your reports as they pertain to Parsoid.
This post can now be deleted if you wish.
I just realized, we both got so distracted by bug-flood that neither of us noticed that you didn't address the semantic-change bug that completely destroyed the post. The one that led me to make a second post. The one where Flow randomly changed it's mind about how to parse the nowikis and changed it's mind about having a line in italics. The nowikis were stable through roundtrips... until it wasn't.
I did, but in the generic thread above. TL:DR; is that I filed T115236
Ok. The Phab only mentioned the no-semantic-change rewriting of the wikitext view. I now take it that the Phab is intended to cover the changing rendered-view as well.
I hid it, but my understanding is that the WMF rejected my report that Flow fails to support deleting.