Jump to content

User:GWicke/Parsoid source ranges

From mediawiki.org

Parsoid annotates tokens with two kinds of source ranges in their dataAttribs property:

tsr
The token source range just for that particular token (often equivalent to a tag).
bsr
The block source range for the top-level block as the tokenizer sees it, set on the first token of the block. A page is modeled as a sequence of top-level blocks.

The token source range is directly useful for the lossless re-serialization of individual unmodified tokens. The DOM tree builder removes attributes of end tags, so normally the end tag source range is lost when building a DOM from tokens. A source range for a given subtree can still be established approximately using (start-)tag source ranges of sibling DOM sub-trees.

A possible strategy to preserve attributes from end tags using empty ('self-closind') elements is described in Parsoid/Todo#DOM_tree_builder.

The block source range directly provides the start and end offsets of an entire DOM tree for well-formed content. Unfortunately we can't currently assume well-formed-ness since the tokenizer only tokenizes (does not build a tree from) html tags, and templates can inject arbitrary tokens. So the value of the block source range is a bit doubtful right now, but might become more useful for known-well-formed content later.