Jump to content

Topic on Help talk:Extension:ParserFunctions

Suggestion to state that {{#if: can check for multiple parameters

14
Snowyamur9889 (talkcontribs)

I discovered a while back through testing on another hosted wiki (MediaWiki 1.39.3, running PHP 8.1.19 (fpm-fcgi) ) that {{#if: can check for multiple parameters as the test string.


For example:

{{#if:{{{Param1|}}} {{{Param2|}}} {{{Param3|}}}|Param1, Param2, or Param3 exist|No parameter strings exist}}

{{{Param1|}}} is checked for as the test string, but {{{Param2|}}} and {{{Param3|}}} can also be checked. If at least one of the parameters exist and has a non-empty string argument, the string "Param1, Param2, or Param3 exist" will be displayed. Otherwise, "No parameter strings exist" will be displayed.


I think this note should be mentioned under the "if" section of this documentation. It's not explicitly mentioned you can do this, but the fact you can makes {{#if: significantly more useful for checking for multiple parameters in the test string.

RobinHood70 (talkcontribs)

Even faster for checking that sort of thing is: {{#if:{{{Param1|{{{Param2|{{{Param3|}}}}}}}}}...etc., though it has the disadvantage of being harder to read. The idea there is that if Param1 has a value, neither of the other two parameters needs to be evaluated.

Verdy p (talkcontribs)

That's not correct. If Param1 is explicitly given an empty value, the result of #if: will be always false, independantly of values given or not given to other parameters (that are used to set a default value for Param1, only if Param1 is not passed at all). As well, this is absolutely not faster. The expression given as a default value of a missing parameter is still evaluated before that parameter is checked, and the recursive inclusion of triple-braces is very unfriendly (avoid it completely, it is very errorprone).


Currently Mediawiki still does not perform "lazy evaluation" for expanding templates and parameters, until their values are effectively used and needed to be partially expanded (and finally fully expanded only at end of parsing).

A true lazy expansion would mean that each node in the parser is in either two state: unexpanded and unparsed text, or expanded and cached value of its expansion (the cache would keep the first unexpanded form as the key: if that key is not found in the cache, then the string is still not expanded and still needs to be parsed; nodes would point then either on an unexpanded text, or on a cache entry containing the <unexpanded,expanded> pair of texts; using a cache would allow avoiding unnecessary conditional expansions, and reuse of results of prior evaluation of the same unexpanded text).

MediaWiki currently uses a cache only for transclusions of templates (after parsing and resolving the template name, and its parameters ordered and named in canonical form), but I'm not even sure it uses any cache for parser function calls, so that they could return different values (e.g. for an "enumerator" parser function, or a parser function return different random numbers. This is alerady false for parser functions that return different time parts, which are all based from the same (cached) local time on the server: once you extract any part of that local time, the resulting parsed page will have its stored cache expiration adjusted/reduced to match the precision of the datepart requeste, but this does not affect how the current page. For "lazy parsing", I was speaking about a transient cache used exclusively in memory by the parser (which is not stored in the DB), and does not survive the full parsing and expansion of the current page.

RobinHood70 (talkcontribs)

I hadn't considered empty values. My mistake, and to be honest, that on its own probably makes the rest of this moot. But leaving that aside for the moment, though, as I recall, the parser will tokenize the values to PPNodes no matter what. So, all three get tokenized regardless of whether they're nested or sequential. During expansion, however, I thought the loop put the resultant value in the named/numbered expansion cache as soon as it found the relevant value and didn't fully expand the default values if it didn't need to (i.e., if one of the values was non-blank/non-whitespace). Am I wrong in that? It's been a while since I've looked at that code. I'll grant that even if it does so, the performance gain is minimal, but it's not nothing.

As far as recursive braces go, the only concern I'm aware of is brace expansion which, of course, is ambiguous if you have 5+ of them. Knowing that the preprocessor bases its decisions on the opening braces, unambiguous parsing is as simple as making sure that your braces use spaces as needed: {{{ {{, for example. Is there something I'm not thinking of that complicates triple-brace usage further? Or did I misunderstand your point?

I'm not sure if I've understood your last paragraph correctly, so I don't know if this is a helpful reply, but I believe parser functions are only cached in the sense that the entire page is cached. If you don't set the cache expiry in your PF, a so-called "random" number generator will happily return the same value on every view/refresh until edited or purged. If the PF does set it to a lower value, the page gets re-evaluated every X seconds. To the best of my knowledge, that means that something like a random number generator would cause the entire page to be re-parsed, most likely at every refresh unless they use a cache-friendlier value.

Verdy p (talkcontribs)

You did not understand: Any parser function that is supposed to return a different random number each time it is called from the same rendered page will still not return different numbers: the invocation is made once and cached (in memory only, not stored). What is cached AND stored in the DB is the result of the full page parsing (that's where you need a "purge/refresh", but "purge/refresh" has no effect on the memory cache for multiple invocations of the same parser function all, or module call, which cannot return different values between each call; but may return different values only when refreshing the whole page)

In summary, MediaWiki uses purely functional programming, invocations can be executed in any order, there should be no "side effect" with hidden "state variables". This also allows Mediawiki to delegate part of the work to multiple parallel workers (if supported), running without any kind of synchronization, and reuse their results (synchronization would occur only on the in-memory cache). If we have hidden state variables that are mutable and that influence the result, this gives a strong performance penalty, forcing the evaluation to be purely sequential, and not allowing a completely "lazy" evaluation with all its advantages in terms of performance.

RobinHood70 (talkcontribs)

If you re-read, I actually did say the same thing as your first paragraph. That's what I was getting at when I said that a random number generator that doesn't set a cache-expiry would only be re-evaluated when the cache is invalidated. If the PF does set the cache-expiry, however, it's quite possible to create a random number generator that changes at every refresh, for example, this implementation of #rand. If you edit the page, you'll see a cachetime parameter which you can adjust. The corresponding value will be set in the page's NewPP limit report if you look at the page source, and you can also confirm the time the page was actually cached. (PS, feel free to edit that page if you want to confirm what I'm saying...that's a development wiki, so nobody will mind.)

Are you sure that invocations can be executed in any order? I was under the impression that MediaWiki's parsing was still linear and invocations come in a fixed order (except maybe if you're using the Visual Editor, since that uses Parsoid). That's been a point of some contention, since allowing invocations to come in any order breaks extensions that rely on preservation of state during parsing, like Extension:Variables and those that do other forms of variable assignment (e.g., loading variable values from a database). I know they talked about breaking that in 1.35, but I saw no mention of it in the change list for that or any version thereafter and I was under the impression that it was only Parsoid that did that, not the still-current preprocessor/parser. I honestly haven't had time to look into it thoroughly.

Verdy p (talkcontribs)

Execution order in MediaWiki was made so that it is not significant; this will enve be more critical for Wikifunctions, which must be purely function, and that will run in random order, over possibly lot of backend servers running asynchronously (the first that replies fills the cache so that further invokation is avoided). Lazy evaluation is critical for Wikifunctions to succeed. Various things have been fixed in MediaWiki to make sure that it can behave as a purely functional language, allowing parallelization. There are still things to do and there a some third-part Mediawiki extensions that still depend on sequential evaluation. For a full lazy evaluation it should not be needed to fully process arguments, except as needed from the head (optionally from the tail): this requires virtuallization of string values, so that we don't need the full expansion of the wikicode, but instead can parse it lazily and partially from left-to-right, just as needed to take the correct decision in branches, and then eliminate unparsed parts so that we don't even need to evaluate them.

This is possible in PHP, as well as in Lua, or even in Javascript, by using a string interface and avoiding using functions like "strlen" as much as possible, that requires a full expansion of its parameter string (for example if we use a code that matches only prefixes or suffixes, we jsut need to expand as many initial or final nodes as needed). Internally the "string-like" object would actually be stored as a tree of nodes, as long as they are not expanded, or as a flat list of string nodes (not necessarily concatenated to avoid costly copies and reallocations) if they are expanded. Even the TidyHTML step is not required to take a single physical string argument, where it can as well read characters from a flat list of string nodes (many of them being possibly identical and not takign extra memory, except for the node itself in the tree or list, represented in an integer-indexed table. This means that nowhere in the parsing, expansion and generation of the HTML we could really need to have the whole page in memory in large string buffers, and we can parallelize all steps of parsing/expansion/tidyHTML so that they behave as though they were operating linearily, and incrementally. This would boost the performance, even on a single server thread.

If we really use lazy evaluation, the number of nodes to evaluate would be very frequently reduced (notably for wiki pages that expand a lot of shared templates or parser functions with conditional results (#ifxxx, #switch) or using only part of their parameter (such as "substring" functions from the start or end of the text). The in-memory cache for lazy evaluation can be the tree of nodes itself, whose evaluation and expansion is partial, and that can also be valuated using delegates running asynchronously in parallel, possibly on multiple server/evaluator instances.

RobinHood70 (talkcontribs)

Thanks for all that. In what version of MW is linear parsing officially broken? I thought that was all being done in Parsoid and that we were safe from that kind of change until then, but from what you're saying, it sounds like even the legacy parser is being affected. That completely messes us up, as 75% of our wiki uses a custom extension that relies on linear parsing and state being maintained within any given frame (not to mention we have cross-frame data as well). The idea of it is that it can modify or create variables as well as returning them or inheriting them across frames, not to mention loading data from other pages. The dev team have promised us a linear parsing model as well, but there's been no information anywhere that I've found, so we're really in a holding pattern until we know what's going on.

Is this all documented anywhere? It would be nice to be able to keep abreast of these changes, but I haven't found anything that even remotely touches on these changes the way you just did.

RobinHood70 (talkcontribs)

Oh and apologies to Snowyamur9889 for this getting so far afield.

Till Kraemer (talkcontribs)

Thank you! I was looking exactly for this. Would be great to have it in the documentation. Cheers and all the best!

DocWatson42 (talkcontribs)
Verdy p (talkcontribs)

You're not even required to use space separators between parameter names in the #if: condition. This works only because if all parameters are empty or just whitespaces (SPACE, TAB, CR, LF), you get a string of whitespaces in the 1st parameter, and "#if:" discards leading and trailing whitespaces (but not whitespaces in the middle) from all its parameters. Note that "#if:" also discards HTML comments everywhere from all its parameters (so as well you can freely insert whitespaces or HTML comments just after "#if: ", or around pipes or before the closing double-draces, and you get the same result).

However "#if:" does not discard leading or trailing whitespaces if they are HTML-encoded, e.g. as &#32;, and for that last string, if it is used in the "#if:" condition, it will evaluate as as false condition. So "{{#if: &#32; | true | &#32;false&#32; }}" returns "#&32;false&32;": the expansion of HTML-encoded character entities into " false " is not performed during the evaluation of parser functions or template expansion, but only on the final phase that generates the HTML from the fully expanded wikitext, and cleans it with HTMLTidy (which may compress whitespaces and whick may then move leading and trailing whitespaces in the content of an element outside of that element, where they may be further compressed, except if the element is "preformated" (like HTML "<pre> </pre>" which is left intact; note also that this HTMLTidy step may reencode some characters using numerical character entities, in hexadecimal, decimal or using predefined named entities like "lt", "gt" or "quot", if this is needed to preserve a valid HTML syntax in the content of text elements or the value of element attributes; which reencoding is used exactly at this step does not matter, as they are fully equivalent in HTML and does not affect the generated HTML DOM, and this encoding is not detectable at all in templates or parserfuntions, or in any client-side javascript). Note that "pre" elements are treated much like "nowiki", so that its content in hidden in a uniq tag and not parsed, but regenerated from the "uniqtag" cache that stores its actual value after the TidyHTML step (so its inner whitespaces in the content are left intact, they are just "HTMLized" using HTML-encoding with character entities as needed).

Note also that if there's any "nowiki" pseudo-element in the condition string of "#if", it will always evaluate this condition to false, even if that "nowiki" pseudo element is completely empty. E.g. "{{#if: <nowiki/> | true | false }}" returns "true". Effectively "nowiki" elements are replaced during the early parsing by some "uniq tag" (starting a special character forbidden in HTML and containing a numeric identifier for the content); and are replaced by the actual content at end of template expansion of parser function calls, but just before the HTMLTidy step which may strip part of the content if it starts or ends by whitespaces.

RobinHood70 (talkcontribs)

I just tried {{#if: <nowiki/> | true | false }} on both my testing wiki and WP to be sure, and as I'd thought, it returns "true", not "false", though your reasoning is pretty much correct, otherwise. It sees <nowiki/> as a non-empty value because of the uniq tags that you mention.

Verdy p (talkcontribs)

You're correct (that's what I described, but I made a bad copy-paste from the former code just above). I just fixed my comment above.

Reply to "Suggestion to state that {{#if: can check for multiple parameters"