User:TBurmeister (WMF)/Measuring page length

This page describes a learning journey and analysis project around the not-so-simple question: "how can I measure the amount of text on a wiki page?". It summarizes:

Available data sources for page size, page length, number of characters, words, sentences, and sections, focused on Wikimedia technical wikis
What data each source offers and how data varies across sources
Potential issues with using those available data sources for measuring the page length and content complexity of technical documentation. Highlights of where available data does or doesn't align with my human perception of page length
Analysis of which types of tech docs and elements of tech docs content are most likely to be impacted by these issues
Ideas for what to do about it!

Motivation and context

My goal is to assess the readability and usability of our technical documentation, as part of the Doc metrics project. I'm interested in the number of characters and words on a page, and in the structure of the page (number of sections), because those content attributes are a strong indicator of readability and quality^[1] Most standards for assessing web content usability use the number of characters of visible text on a page to assess the content^[2], since this is what the user experiences.

In trying to define a scoring rubric for a "page length" metric ( i.e. if Page_length < 20k bytes then 1, else 0), I realized I had a misconception that "page size" in bytes would reflect the number of characters and the amount of content on the page, thus aligning with how a human reader would experience it. Even though I knew that pages are stored in wikitext and rendered as HTML in many varied ways^[3], I hadn't considered how that would mean that page size in bytes wouldn't be a reliable way to measure page length, or the amount of content on a page, for tech docs metrics.

Even when I had this realization, I still thought there must be a data source out there that could give me character or word counts for the rendered HTML. Thus began my quest.

Questions to answer

Which data sources provide data for page size or page length?
Do the available data sources align or diverge in their calculations of page size / page length?
How do different types of page content and formatting influence page size and page length measurements? Are these interactions consistent or unpredictable? Do they have particular impact for technical documentation, or for certain types of tech docs?
Do available data sources reflect how a human would assess the length of a page? Does the data (whether bytes, number of sections, number of characters, or whatever) actually correlate with whether a page is "text-heavy" or "very long"?

Data sources

I used the following data sources and tools in this analysis:

MediaWiki Page Information
XTools - specifically XTools/Page History
prosesize
Expresso - limited to plain text under 5000 words.
HTML page contents retrieved via https://www.mediawiki.org/api/rest_v1/#/Page%20content/get_page_html__title_, then copied and pasted as unformatted text into LibreOffice, to use the software's built-in Word Count function.
- For LibreOffice-based "all chars" or "all words" data, I removed only the languages menu and navigation menu content. The character counts exclude spaces.
- For LibreOffice-based "prose chars" or "prose words" data, I removed all code samples, any content that looked like code (i.e. example CSS or CLI commands), and reference tables that didn't contain full sentences or prose-like content. This was not an exact science. The character counts exclude space.
Printable version of the page, generated by clicking the MediaWiki "Printable version" link when logged in (so using the same skin/user preferences) and with the same browser settings each time.

I reviewed but did not use these additional data sources or calculation methods:

Wikistats - see Research:Wikistats_metrics/Bytes
Using mwparserfromhtml and its methods like .get_plaintext(), as demonstrated in https://gitlab.wikimedia.org/repos/research/html-dumps/-/blob/main/docs/tutorials/example_notebook.ipynb (ironically, this might be the best solution to all the issues outlined here >_<)
Directly processing raw wikitext using built-in string methods, i.e. example 4 in https://public-paws.wmcloud.org/User:Tizianopiccardi/ICWSM_tutorial-content.ipynb

Data elements

Page bytes

This is the standard method of measuring page size in most MediaWiki built-in metrics tools and the corresponding database tables. Consequently, you can access page size in bytes through a variety of different tools and dashboards:

MediaWiki Page Information
Action API module API:Info
- According tothe API:Info notes section, the built-in MediaWiki Page Information tool uses a separate module, but "much of the information it returns overlaps the API:Info module".

XTools - XTools/Page History contains both page size in bytes (the same as all above sources), but also "prose size" in bytes (see below).

Prose measurements: bytes, characters, words

Prose measurements quantify only the subset of content on a page that a given tool considers to be "prose". Each tool I used has its own calculation, and their conclusions about "what is prose" vary slightly:

XTools - XTools/Page History displays "prose size" in bytes, along with the number of characters and words in the page sections the tool considers to be "prose".
- The algorithm used to calculate prose was "inspired by toolforge:prosesize", but it appears to not yield exactly the same results. See explanation in https://blog.legoktm.com/2023/02/25/measuring-the-length-of-wikipedia-articles.html
- Definition of prose excludes templates, and content created by TemplateStyles, Math and Cite^[4].
prosesize:
- Size (in bytes) of the text within readable prose sections on the page (calculated via the Rust string:len method).^[5]^[6]
- Based on Parsoid HTML for each wiki page.
- "counts the text within <p> tags in the HTML source of the document, which corresponds almost exactly to the definition of "readable prose". This method is not perfect, however, and may include text which isn't prose, or exclude text which is (e.g. in {{cquote}}, or prose written in bullet-point form)."

I also calculated prose based on my own human analysis, because I wanted to capture "all the text on the page that isn't menus, code, or reference tables". Because this was a quick and dirty experiment, I did it manually using LibreOffice and my little human brain (see Data sources above for details).

All characters and words

These data points came from LibreOffice, generated as described in #Data_sources above. I was not able to find any easier-to-use data source that would give me a character and word count for all the text in all content sections of a rendered wiki page. Perhaps mwparserfromhtml would be the key to this, but I didn't have time to dig into that.

For samples under 5000 words, I pasted the plain text from LibreOffice into Expresso, which has a different counting algorithm, so its numbers vary slightly from those of LibreOffice.

Sections

As reported in XTools/Page History. Section count includes the lead section, so all pages will have at least one section.

Sentences

For samples under 5000 words, I pasted the plain text from LibreOffice into Expresso, which can calculate sentences and other more detailed text metrics.

Pages

Based on MediaWiki "printable version". See #Data_sources above. Admittedly a very subjective piece of data, but I needed some way to capture the relative differences in how much scrolling one would have to do in order to go through an entire page.

Methodology

I did this manually because I wanted to understand the nuances of what was going on, and make conclusions about the data by comparing it with my human perceptions of page content as I did the work.

Selected 20 technical documentation pages of varying lengths and formats, from several different collections, and representing multiple doc types (reference, tutorial, landing page, etc.).
For each page, I gathered the following data, if available, from each data source:
- Bytes: size of the page as reported by MediaWiki via Page Information (or XTools/Page History, which uses the same data)
- Prose measurements:
  - Prose bytes: size of the subset of page content that a given tool considers to be "prose"
  - Prose chars: number of characters in the subset of page content that a given tool considers to be "prose"
  - Prose words: word count in the subset of page content that a given tool considers to be "prose"
- Chars with all page contents included: based on LibreOffice measurements of unformatted HTML page contents, with nav menus removed.
- Words with all page contents included: based on LibreOffice measurements of unformatted HTML page contents, with nav menus removed.
- Sentences: only accessible through Expresso, so limited to content under 5,000 words.
- Sections: as reported by XTools/Page History.
- Print pages: number of pages in the PDF document generated by clicking the MediaWiki "Printable version" link when viewing a page.
- Human page length rating: a rough estimate based on opening the page, scrolling to the bottom, and giving a rating based on a scale of small - medium - large - very large - epic. I didn't attempt to align these values with any page size ranges; my goal was to record a gut reaction of how I perceived the page size.
- Doc type: a rough assessment of the type of content on the page, based on standard document types for technical docs.
Based on that data, I calculated the following:
- Prose bytes vs. page bytes: A ratio calculated by dividing prose bytes by page bytes, using data from XTools/Page History. Captures what percentage of the bytes on a page are "prose".
- Prose chars vs. LibreOffice prose chars: A ratio calculated by dividing prose chars value (from XTools/Page History) by the character count of prose as calculated by LibreOffice. Captures the difference between what tools like XTools and prosesize count as "prose" vs. what a human (me) considers to be prose.
- Prose chars vs. LibreOffice all chars: A ratio calculated by dividing prose chars value (from XTools/Page History) by the character count of all text content rendered on the page, excluding menus, calculated by LibreOffice. Captures the difference between the "prose" and the actual amount of content rendered on the page.
- LibreOffice all chars vs. page bytes: A ratio calculated by dividing the character count of all text content rendered on the page, excluding menus, calculated by LibreOffice, by the page size in bytes, as reported by MediaWiki via Page Information (or XTools/Page History, which uses the same data).

Findings

You can view all my raw data and calculations in this Google Spreadsheet.

Page size in bytes does not reflect actual page length


Page	LibreOffice char count	Bytes	LibreOffice all chars vs. page bytes	Print pages	Human page length rating	Doc type
Manual:Hooks	71139	177849	0.400	23	Epic	How-to and reference
Manual:FAQ	45260	89127	0.508	38	Epic	FAQ
ResourceLoader/Migration_guide_(users)	48430	79408	0.610	20	Long	Reference
API:REST_API/Reference	50816	65205	0.779	59	Epic	API docs; Reference; code samples

Pages that use code samples or tables, like reference pages, will naturally be longer to the human eye due to how code is formatted and distributed across lines. In general, it's acceptable for reference docs to be long, because we assume that users will employ a targeted search or skimming approach to quickly locate the information they need. So, I didn't want to compare reference content with non-reference content.

However, even when I inspected non-reference pages, I found surprises in how rendered page length correlates with page bytes:


Page	LibreOffice char count	Bytes	LibreOffice all chars vs. page bytes	Print pages	Human page length rating	Doc type
Manual:How_to_make_a_MediaWiki_skin	18297	35909	0.510	14	Very long	How-to and Tutorial
Manual:Developing_extensions	18393	32430	0.567	12	Long	How-to
Writing_an_extension_for_deployment	11742	26202	0.448	6	Long	How-to
API:Parsing_wikitext	4433	5183	0.855	11	Medium - long	API docs; code samples
API:Main_page	2632	4210	0.625	16	Medium - long	Landing page / overview

Manual:Developing_extensions and Manual:How to make a MediaWiki skin are both long how-to docs with more characters and bytes than API:Main_page or API:Parsing_wikitext, but the latter two pages are nearly equal in length when rendered in a browser or for printing.
- The differences here are due to transcluded content in fancy templates, and code formatting.
I'm surprised that the ratio of characters to bytes isn't more consistent. I expected variation, but not at the scale I saw. The standard deviation for the LibreOffice character count vs. page bytes ratio across my 20 docs was 0.324, with a mean of 0.648, a min of 0.363, and a max of 1.678.
How_to_become_a_MediaWiki_hacker (21,144 bytes) transcludes New_Developers/Communication_tips which is, by itself, 6,878 bytes and 709 words. All of that content is not counted in the byte-size measurement for the page in which it is transcluded.

It seems that character count and byte size vary in their reliability as a measurement of page length based on:

Whether the page uses templates or transclusion
Whether the page uses code samples or other content that generates more white space than paragraph text

Page size in bytes varies by language

Not all languages have a one-to-one correlation between characters and bytes like English (generally) does. This has real implications for languages like Hebrew and Russian. See, for example, ⚓ T275319 Change $wgMaxArticleSize limit from byte-based to character-based.

Some of our technical documentation is in English, so this issue isn't as relevant on sites like Wikitech. However, on mediawiki.org translations are part of our technical documentation, so this issue is relevant in that context.

The Ukrainian translation of Help:Magic words/hu is 118,694 bytes at 70% translated. The English version of the same page is 107,602 bytes.
XTools "Largest Pages" provides a useful way to compare the different sizes in page bytes across translations for a given page: https://xtools.wmcloud.org/largestpages/www.mediawiki.org/12?include_pattern=%25Magic_words%25 (keep in mind that this is inherently flawed as a measure due to inconsistent translation completion percentages across language version)

The rendered HTML of a page will likely *also* vary by language, so it's not like this is a problem that only impacts byte-based measurements. Character-based measurements would be a better data point...but the only tools I found with that data were limited to measuring "prose".

Prose measurements leave out too much page content for assessing tech docs

In the sections below, I link to the prosesize tool because it displays what it did / didn't count as prose. In my calculations, I used the XTools prose measurements.

The prose measurement data sources generally exclude non-paragraph text when measuring prose content. At first, I thought this would be okay, since my goal is to identify which pages in a collection of technical documentation have "walls of text", and/or a structure that is so lengthy or complex that it's likely to be impact developer experience. Even though wikipedia:Wikipedia:Prosesize excludes lists from its calculation of "what is prose", I thought that might be okay. Lists are a good way to add structure to content, and to break up walls of text. So, maybe a page length measurement that excludes lists could be acceptable, if my main goal is to find un-structured, large chunks of prose.

Example: Writing_an_extension_for_deployment uses lists to structure content, but (as a human reviewer) I still find this page to be overwhelming and text-heavy. So, maybe excluding lists is not actually a good idea for assessing readability or page UX in the tech docs context.
The more structured a page is, the less "prose bytes" may represent its real length. For example: Wikimedia_tutorials uses a layout grid, and prose tools discount all the text inside the content grid boxes. So, only 3.7% of the text on the page is captured.

Different tools and data sources use varying definitions for what is "text" on a page, and they may vary in how they parse the wikitext or HTML. As a result, different data sources report different numbers for character or word counts, depending on their parsing strategy, and on how they identify what counts as "text" to be measured. After reviewing the prose measurements from XTools and prosesize, I concluded their divergence is small enough to be ignored. (But I still recorded the raw numbers from each in my spreadsheet).

Across the 20 docs I analyzed, prose tools captured on average 83% of what I considered to be prose chars. This is great! But, when I compare the content captured by prose tools vs. all the content on the page (including code samples), prose tools only captured, on average, 33% of the page content.

As expected, the capture rate was worst for docs that contain primarily reference tables, code samples, or layout elements:

Database_field_prefixes 1.6% of content captured
API:Allmessages 2.5% of content captured

For other doc types, prose tools captured more of the page content, but I started to see how much essential tech doc content we'd be missing if we exclude code elements from our measurements:

For Manual:Hooks, prose tools captured 95% of what I also considered to be prose. But that prose the tools captured, when compared to the full character count of the page, only represents 11% of the page content. In this case, only looking at prose would mean not measuring the majority of the page's text, because that text is in code samples or one very large table.
For API:Nearby_places_viewer: only 31% of page content is captured if code is excluded. That might be fine, but it doesn't help us assess whether our tutorials are too long, or if the code samples themselves are very wordy.

Code samples, while not "prose", are essential in multiple types of technical documentation^[7]. I think code samples should be included in how we assess the length and complexity of a page. Their formatting usually causes pages to be longer, even though the amount of content on each line is much less than that of prose. Lists, code samples, and other non-paragraph formats, while not technically prose, likely do still contribute to the length of a page and the cognitive burden of using it. I think this calls into question the utility of page length as a quality measurement for any type of tech doc that extensively uses code samples.

All the data

In this Google Spreadsheet

Conclusions and next steps

I'm really wondering: did I miss some tool or data source where I could easily get the number of characters or words of the rendered HTML page? Or even statistics for character and word count that are less restrictive than prose-based measurements?

For now, my conclusions are:

To assess the readability and complexity of tech docs, we need to measure more content than just what fits into a traditional definition of "prose". Prose measurements (as currently implemented in the tools I used) exclude too many pieces of content that are essential parts of technical documentation. (This is not to throw shade on those tools: they are awesome and were probably built for Wikipedia article assessment, which is a very different type of text).
We can't rely on page size in bytes to reflect the human experience of page content. Wikitext is too good at disguising lots of complexity in simple markup!
- Code samples and large amounts of template content matter for assessing page length and readability, so we can't reliably use page bytes as a proxy for measuring page length.
- We need a measure of page length that corresponds to the content as the human reader experiences it when viewing the page, which requires using the rendered HTML. OR we could just abandon trying to measure page length at all since there are so many variables impacting its utility.
We can only use page length as a doc quality metric for types of docs that don't have many code samples. So, probably not tutorials and definitely not reference docs. Since we don't have programmatically accessible page metadata about doc type, I'm not sure it's worth the effort to implement page length as one of our metrics inputs.
Measuring readability is hard.

Future work

Try using mwparserfromhtml, see if it aligns more closely with human experience of page length and content density.
Investigate section count as a way of understanding doc length? (didn't have time to dig into that data)
Research whether long code samples hinder readability in the same way that long paragraphs do. Is there a line limit at which we should instead just link to example files stored in source control, instead of putting code in wiki pages?
- Developers in Meng (2018) regarded code examples to be more informative and "can be grasped faster than text"(p. 321). Meng also concludes, "Importantly, code examples also seem to serve a signaling function. They help developers identify relevant sections in the documentation. For example, when scanning a page, developers first check the code in order to verify that the page content actually relates to their current problem" (p. 322).

Ideally, I'd like to leave all of this calculation of doc quality based on content attributes up to an ML model, but that isn't yet feasible^[8]. At the very least, this deep dive has deepened my understanding of which content features would be relevant, if we were to design/train a content quality model specifically for assessing technical documentation.

Related resources

Quarry query to find pages on mediawiki.org by byte size
https://meta.wikimedia.org/wiki/Title_length
Page title size limitations
Wikipedia:Size comparisons - Wikipedia
More example code: see Question 4 in https://public-paws.wmcloud.org/User:Tizianopiccardi/ICWSM_tutorial-content.ipynb

Additional interesting examples

User:קיפודנחש/huge_test - This user did a test to find the limits of template expansion. The page size in bytes is ~135K but the page exceeds the MediaWiki software's limit for template include size.
Help:Magic_words - long page, lots of reference tables, hefty nav template at the bottom. Page size in bytes (107,602) is less than, for example, Mediawiki-Vagrant (73,150)
Lua/Tutorial - long page, lots of code samples and tables. Page size 90,694 bytes.
Wikidata_Query_Service/User_Manual - long page, 61,917 bytes
Phabricator:Help - long page, with many videos too! 59,635 bytes
Template:Wikimedia_extension_database_tables is 12,791 bytes and a large chunk of content that is added to each of the pages of individual database table documentation. That means that the size of those pages will be under-represented by 12,791 bytes. For comparison, a page that is mostly text but a similar number of bytes: Content_translation/Machine_Translation/NLLB-200.

References

↑ https://meta.wikimedia.org/wiki/Machine_learning_models/Production/Language-agnostic_Wikipedia_article_quality
↑ Aliaksei Miniukovich, Antonella De Angeli, Simone Sulpizio, and Paola Venuti. 2017. Design Guidelines for Web Readability. In Proceedings of the 2017 Conference on Designing Interactive Systems (DIS '17). Association for Computing Machinery, New York, NY, USA, 285–296. https://doi.org/10.1145/3064663.3064711
↑ meta:Research:Data_introduction#Wikitext_doesn't_fully_represent_page_content:_use_HTML_instead
↑ XTools/Page History#Prose
↑ https://gitlab.wikimedia.org/repos/mwbot-rs/mwbot/-/blob/main/wikipedia_prosesize/src/lib.rs#L70
↑ https://en.wikipedia.org/wiki/Wikipedia:Prosesize
↑ Meng, M., Steinhardt, S., & Schubert, A. (2018). Application Programming Interface Documentation: What Do Software Developers Want? Journal of Technical Writing and Communication, 48(3), 295-330. https://doi.org/10.1177/0047281617721853
↑ Page length is a feature used to predict article or revision quality in many of our ML models. Ideally we could just use that instead of computing page length ourselves to then calculate tech doc metrics based on it, but page quality ML models are only intended for use on Wikipedias, not on technical wikis (so far). The criteria for quality in the tech docs context also differ from the encylopedic context, so the models may not be applicable even if they were available for content on mediawiki.org.

[1] ttps://meta.wikimedia.org/wiki/Machine_learning_models/Production/Language-agnostic_Wikipedia_article_quality

[2] Aliaksei Miniukovich, Antonella De Angeli, Simone Sulpizio, and Paola Venuti. 2017. Design Guidelines for Web Readability. In Proceedings of the 2017 Conference on Designing Interactive Systems (DIS '17). Association for Computing Machinery, New York, NY, USA, 285–296. https://doi.org/10.1145/3064663.3064711

[3] ta:Research:Data_introduction#Wikitext_doesn't_fully_represent_page_content:_use_HTML_instead

[4] XTools/Page History#Prose

[5] ttps://gitlab.wikimedia.org/repos/mwbot-rs/mwbot/-/blob/main/wikipedia_prosesize/src/lib.rs#L70

[:0-6] ttps://en.wikipedia.org/wiki/Wikipedia:Prosesize

[7] Meng, M., Steinhardt, S., & Schubert, A. (2018). Application Programming Interface Documentation: What Do Software Developers Want? Journal of Technical Writing and Communication, 48(3), 295-330. https://doi.org/10.1177/0047281617721853

[8] Page length is a feature used to predict article or revision quality in many of our ML models. Ideally we could just use that instead of computing page length ourselves to then calculate tech doc metrics based on it, but page quality ML models are only intended for use on Wikipedias, not on technical wikis (so far). The criteria for quality in the tech docs context also differ from the encylopedic context, so the models may not be applicable even if they were available for content on mediawiki.org.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]