Jump to content

User:TBurmeister (WMF)/Measuring page length

From mediawiki.org

This page describes a learning journey and analysis project around the not-so-simple question: "how can I measure the amount of text on a wiki page?". It summarizes:

  • Available data sources for page size, page length, number of characters, words, sentences, and sections, focused on Wikimedia technical wikis
  • What data each source offers and how data varies across sources
  • Potential issues with using those available data sources for measuring the page length and content complexity of technical documentation. Highlights of where available data does or doesn't align with my human perception of page length
  • Analysis of which types of tech docs and elements of tech docs content are most likely to be impacted by these issues
  • Ideas for what to do about it!

Motivation and context

[edit]

My goal is to assess the readability and usability of our technical documentation, as part of the Doc metrics project. I'm interested in the number of characters and words on a page, and in the structure of the page (number of sections), because those content attributes are a strong indicator of readability and quality[1] Most standards for assessing web content usability use the number of characters of visible text on a page to assess the content[2], since this is what the user experiences.

In trying to define a scoring rubric for a "page length" metric ( i.e. if Page_length < 20k bytes then 1, else 0), I realized I had a misconception that "page size" in bytes would reflect the number of characters and the amount of content on the page, thus aligning with how a human reader would experience it. Even though I knew that pages are stored in wikitext and rendered as HTML in many varied ways[3], I hadn't considered how that would mean that page size in bytes wouldn't be a reliable way to measure page length, or the amount of content on a page, for tech docs metrics.

Even when I had this realization, I still thought there must be a data source out there that could give me character or word counts for the rendered HTML. Thus began my quest.

Questions to answer

[edit]
  • Which data sources provide data for page size or page length?
  • Do the available data sources align or diverge in their calculations of page size / page length?
  • How do different types of page content and formatting influence page size and page length measurements? Are these interactions consistent or unpredictable? Do they have particular impact for technical documentation, or for certain types of tech docs?
  • Do available data sources reflect how a human would assess the length of a page? Does the data (whether bytes, number of sections, number of characters, or whatever) actually correlate with whether a page is "text-heavy" or "very long"?

Data sources

[edit]

I used the following data sources and tools in this analysis:

  • MediaWiki Page Information
  • XTools - specifically XTools/Page History
  • prosesize
  • Expresso - limited to plain text under 5000 words.
  • HTML page contents retrieved via https://www.mediawiki.org/api/rest_v1/#/Page%20content/get_page_html__title_, then copied and pasted as unformatted text into LibreOffice, to use the software's built-in Word Count function.
    • For LibreOffice-based "all chars" or "all words" data, I removed only the languages menu and navigation menu content. The character counts exclude spaces.
    • For LibreOffice-based "prose chars" or "prose words" data, I removed all code samples, any content that looked like code (i.e. example CSS or CLI commands), and reference tables that didn't contain full sentences or prose-like content. This was not an exact science. The character counts exclude space.
  • Printable version of the page, generated by clicking the MediaWiki "Printable version" link when logged in (so using the same skin/user preferences) and with the same browser settings each time.

I reviewed but did not use these additional data sources or calculation methods:

Data elements

[edit]

Page bytes

[edit]

This is the standard method of measuring page size in most MediaWiki built-in metrics tools and the corresponding database tables. Consequently, you can access page size in bytes through a variety of different tools and dashboards:

  • XTools - XTools/Page History contains both page size in bytes (the same as all above sources), but also "prose size" in bytes (see below).

Prose measurements: bytes, characters, words

[edit]

Prose measurements quantify only the subset of content on a page that a given tool considers to be "prose". Each tool I used has its own calculation, and their conclusions about "what is prose" vary slightly:

  • XTools - XTools/Page History displays "prose size" in bytes, along with the number of characters and words in the page sections the tool considers to be "prose".
  • prosesize:
    • Size (in bytes) of the text within readable prose sections on the page (calculated via the Rust string:len method).[5][6]
    • Based on Parsoid HTML for each wiki page.
    • "counts the text within <p> tags in the HTML source of the document, which corresponds almost exactly to the definition of "readable prose". This method is not perfect, however, and may include text which isn't prose, or exclude text which is (e.g. in {{cquote}}, or prose written in bullet-point form)."

I also calculated prose based on my own human analysis, because I wanted to capture "all the text on the page that isn't menus, code, or reference tables". Because this was a quick and dirty experiment, I did it manually using LibreOffice and my little human brain (see Data sources above for details).

All characters and words

[edit]

These data points came from LibreOffice, generated as described in #Data_sources above. I was not able to find any easier-to-use data source that would give me a character and word count for all the text in all content sections of a rendered wiki page. Perhaps mwparserfromhtml would be the key to this, but I didn't have time to dig into that.

For samples under 5000 words, I pasted the plain text from LibreOffice into Expresso, which has a different counting algorithm, so its numbers vary slightly from those of LibreOffice.

Sections

[edit]

As reported in XTools/Page History. Section count includes the lead section, so all pages will have at least one section.

Sentences

[edit]

For samples under 5000 words, I pasted the plain text from LibreOffice into Expresso, which can calculate sentences and other more detailed text metrics.

Pages

[edit]

Based on MediaWiki "printable version". See #Data_sources above. Admittedly a very subjective piece of data, but I needed some way to capture the relative differences in how much scrolling one would have to do in order to go through an entire page.

Methodology

[edit]

I did this manually because I wanted to understand the nuances of what was going on, and make conclusions about the data by comparing it with my human perceptions of page content as I did the work.

  1. Selected 20 technical documentation pages of varying lengths and formats, from several different collections, and representing multiple doc types (reference, tutorial, landing page, etc.).
  2. For each page, I gathered the following data, if available, from each data source:
    • Bytes: size of the page as reported by MediaWiki via Page Information (or XTools/Page History, which uses the same data)
    • Prose measurements:
      • Prose bytes: size of the subset of page content that a given tool considers to be "prose"
      • Prose chars: number of characters in the subset of page content that a given tool considers to be "prose"
      • Prose words: word count in the subset of page content that a given tool considers to be "prose"
    • Chars with all page contents included: based on LibreOffice measurements of unformatted HTML page contents, with nav menus removed.
    • Words with all page contents included: based on LibreOffice measurements of unformatted HTML page contents, with nav menus removed.
    • Sentences: only accessible through Expresso, so limited to content under 5,000 words.
    • Sections: as reported by XTools/Page History.
    • Print pages: number of pages in the PDF document generated by clicking the MediaWiki "Printable version" link when viewing a page.
    • Human page length rating: a rough estimate based on opening the page, scrolling to the bottom, and giving a rating based on a scale of small - medium - large - very large - epic. I didn't attempt to align these values with any page size ranges; my goal was to record a gut reaction of how I perceived the page size.
    • Doc type: a rough assessment of the type of content on the page, based on standard document types for technical docs.
  3. Based on that data, I calculated the following:
    • Prose bytes vs. page bytes: A ratio calculated by dividing prose bytes by page bytes, using data from XTools/Page History. Captures what percentage of the bytes on a page are "prose".
    • Prose chars vs. LibreOffice prose chars: A ratio calculated by dividing prose chars value (from XTools/Page History) by the character count of prose as calculated by LibreOffice. Captures the difference between what tools like XTools and prosesize count as "prose" vs. what a human (me) considers to be prose.
    • Prose chars vs. LibreOffice all chars: A ratio calculated by dividing prose chars value (from XTools/Page History) by the character count of all text content rendered on the page, excluding menus, calculated by LibreOffice. Captures the difference between the "prose" and the actual amount of content rendered on the page.
    • LibreOffice all chars vs. page bytes: A ratio calculated by dividing the character count of all text content rendered on the page, excluding menus, calculated by LibreOffice, by the page size in bytes, as reported by MediaWiki via Page Information (or XTools/Page History, which uses the same data).

Findings

[edit]

You can view all my raw data and calculations in this Google Spreadsheet.

Page size in bytes does not reflect actual page length

[edit]
Page LibreOffice

char count

Bytes LibreOffice all

chars vs. page bytes

Print pages Human page

length rating

Doc type
Manual:Hooks 71139 177849 0.400 23 Epic How-to and reference
Manual:FAQ 45260 89127 0.508 38 Epic FAQ
ResourceLoader/Migration_guide_(users) 48430 79408 0.610 20 Long Reference
API:REST_API/Reference 50816 65205 0.779 59 Epic API docs; Reference; code samples

Pages that use code samples or tables, like reference pages, will naturally be longer to the human eye due to how code is formatted and distributed across lines. In general, it's acceptable for reference docs to be long, because we assume that users will employ a targeted search or skimming approach to quickly locate the information they need. So, I didn't want to compare reference content with non-reference content.

However, even when I inspected non-reference pages, I found surprises in how rendered page length correlates with page bytes:

Page LibreOffice char count Bytes LibreOffice all chars vs. page bytes Print pages Human page length rating Doc type
Manual:How_to_make_a_MediaWiki_skin 18297 35909 0.510 14 Very long How-to and Tutorial
Manual:Developing_extensions 18393 32430 0.567 12 Long How-to
Writing_an_extension_for_deployment 11742 26202 0.448 6 Long How-to
API:Parsing_wikitext 4433 5183 0.855 11 Medium - long API docs; code samples
API:Main_page 2632 4210 0.625 16 Medium - long Landing page / overview
  • Manual:Developing_extensions and Manual:How to make a MediaWiki skin are both long how-to docs with more characters and bytes than API:Main_page or API:Parsing_wikitext, but the latter two pages are nearly equal in length when rendered in a browser or for printing.
    • The differences here are due to transcluded content in fancy templates, and code formatting.
  • I'm surprised that the ratio of characters to bytes isn't more consistent. I expected variation, but not at the scale I saw. The standard deviation for the LibreOffice character count vs. page bytes ratio across my 20 docs was 0.324, with a mean of 0.648, a min of 0.363, and a max of 1.678.
  • How_to_become_a_MediaWiki_hacker (21,144 bytes) transcludes New_Developers/Communication_tips which is, by itself, 6,878 bytes and 709 words. All of that content is not counted in the byte-size measurement for the page in which it is transcluded.

It seems that character count and byte size vary in their reliability as a measurement of page length based on:

  • Whether the page uses templates or transclusion
  • Whether the page uses code samples or other content that generates more white space than paragraph text

Page size in bytes varies by language

[edit]

Not all languages have a one-to-one correlation between characters and bytes like English (generally) does. This has real implications for languages like Hebrew and Russian. See, for example, ⚓ T275319 Change $wgMaxArticleSize limit from byte-based to character-based.

Some of our technical documentation is in English, so this issue isn't as relevant on sites like Wikitech. However, on mediawiki.org translations are part of our technical documentation, so this issue is relevant in that context.

The rendered HTML of a page will likely *also* vary by language, so it's not like this is a problem that only impacts byte-based measurements. Character-based measurements would be a better data point...but the only tools I found with that data were limited to measuring "prose".

Prose measurements leave out too much page content for assessing tech docs

[edit]
In the sections below, I link to the prosesize tool because it displays what it did / didn't count as prose. In my calculations, I used the XTools prose measurements.

The prose measurement data sources generally exclude non-paragraph text when measuring prose content. At first, I thought this would be okay, since my goal is to identify which pages in a collection of technical documentation have "walls of text", and/or a structure that is so lengthy or complex that it's likely to be impact developer experience. Even though wikipedia:Wikipedia:Prosesize excludes lists from its calculation of "what is prose", I thought that might be okay. Lists are a good way to add structure to content, and to break up walls of text. So, maybe a page length measurement that excludes lists could be acceptable, if my main goal is to find un-structured, large chunks of prose.

  • Example: Writing_an_extension_for_deployment uses lists to structure content, but (as a human reviewer) I still find this page to be overwhelming and text-heavy. So, maybe excluding lists is not actually a good idea for assessing readability or page UX in the tech docs context.
  • The more structured a page is, the less "prose bytes" may represent its real length. For example: Wikimedia_tutorials uses a layout grid, and prose tools discount all the text inside the content grid boxes. So, only 3.7% of the text on the page is captured.

Different tools and data sources use varying definitions for what is "text" on a page, and they may vary in how they parse the wikitext or HTML. As a result, different data sources report different numbers for character or word counts, depending on their parsing strategy, and on how they identify what counts as "text" to be measured. After reviewing the prose measurements from XTools and prosesize, I concluded their divergence is small enough to be ignored. (But I still recorded the raw numbers from each in my spreadsheet).

Across the 20 docs I analyzed, prose tools captured on average 83% of what I considered to be prose chars. This is great! But, when I compare the content captured by prose tools vs. all the content on the page (including code samples), prose tools only captured, on average, 33% of the page content.

As expected, the capture rate was worst for docs that contain primarily reference tables, code samples, or layout elements:

For other doc types, prose tools captured more of the page content, but I started to see how much essential tech doc content we'd be missing if we exclude code elements from our measurements:

  • For Manual:Hooks, prose tools captured 95% of what I also considered to be prose. But that prose the tools captured, when compared to the full character count of the page, only represents 11% of the page content. In this case, only looking at prose would mean not measuring the majority of the page's text, because that text is in code samples or one very large table.
  • For API:Nearby_places_viewer: only 31% of page content is captured if code is excluded. That might be fine, but it doesn't help us assess whether our tutorials are too long, or if the code samples themselves are very wordy.

Code samples, while not "prose", are essential in multiple types of technical documentation[7]. I think code samples should be included in how we assess the length and complexity of a page. Their formatting usually causes pages to be longer, even though the amount of content on each line is much less than that of prose. Lists, code samples, and other non-paragraph formats, while not technically prose, likely do still contribute to the length of a page and the cognitive burden of using it. I think this calls into question the utility of page length as a quality measurement for any type of tech doc that extensively uses code samples.

All the data

[edit]

In this Google Spreadsheet

Conclusions and next steps

[edit]

I'm really wondering: did I miss some tool or data source where I could easily get the number of characters or words of the rendered HTML page? Or even statistics for character and word count that are less restrictive than prose-based measurements?

For now, my conclusions are:

  • To assess the readability and complexity of tech docs, we need to measure more content than just what fits into a traditional definition of "prose". Prose measurements (as currently implemented in the tools I used) exclude too many pieces of content that are essential parts of technical documentation. (This is not to throw shade on those tools: they are awesome and were probably built for Wikipedia article assessment, which is a very different type of text).
  • We can't rely on page size in bytes to reflect the human experience of page content. Wikitext is too good at disguising lots of complexity in simple markup!
    • Code samples and large amounts of template content matter for assessing page length and readability, so we can't reliably use page bytes as a proxy for measuring page length.
    • We need a measure of page length that corresponds to the content as the human reader experiences it when viewing the page, which requires using the rendered HTML. OR we could just abandon trying to measure page length at all since there are so many variables impacting its utility.
  • We can only use page length as a doc quality metric for types of docs that don't have many code samples. So, probably not tutorials and definitely not reference docs. Since we don't have programmatically accessible page metadata about doc type, I'm not sure it's worth the effort to implement page length as one of our metrics inputs.
  • Measuring readability is hard.

Future work

[edit]
  • Try using mwparserfromhtml, see if it aligns more closely with human experience of page length and content density.
  • Investigate section count as a way of understanding doc length? (didn't have time to dig into that data)
  • Research whether long code samples hinder readability in the same way that long paragraphs do. Is there a line limit at which we should instead just link to example files stored in source control, instead of putting code in wiki pages?
    • Developers in Meng (2018) regarded code examples to be more informative and "can be grasped faster than text"(p. 321). Meng also concludes, "Importantly, code examples also seem to serve a signaling function. They help developers identify relevant sections in the documentation. For example, when scanning a page, developers first check the code in order to verify that the page content actually relates to their current problem" (p. 322).

Ideally, I'd like to leave all of this calculation of doc quality based on content attributes up to an ML model, but that isn't yet feasible[8]. At the very least, this deep dive has deepened my understanding of which content features would be relevant, if we were to design/train a content quality model specifically for assessing technical documentation.

[edit]

Additional interesting examples

[edit]

References

[edit]
  1. https://meta.wikimedia.org/wiki/Machine_learning_models/Production/Language-agnostic_Wikipedia_article_quality
  2. Aliaksei Miniukovich, Antonella De Angeli, Simone Sulpizio, and Paola Venuti. 2017. Design Guidelines for Web Readability. In Proceedings of the 2017 Conference on Designing Interactive Systems (DIS '17). Association for Computing Machinery, New York, NY, USA, 285–296. https://doi.org/10.1145/3064663.3064711
  3. meta:Research:Data_introduction#Wikitext_doesn't_fully_represent_page_content:_use_HTML_instead
  4. XTools/Page History#Prose
  5. https://gitlab.wikimedia.org/repos/mwbot-rs/mwbot/-/blob/main/wikipedia_prosesize/src/lib.rs#L70
  6. https://en.wikipedia.org/wiki/Wikipedia:Prosesize
  7. Meng, M., Steinhardt, S., & Schubert, A. (2018). Application Programming Interface Documentation: What Do Software Developers Want? Journal of Technical Writing and Communication, 48(3), 295-330. https://doi.org/10.1177/0047281617721853
  8. Page length is a feature used to predict article or revision quality in many of our ML models. Ideally we could just use that instead of computing page length ourselves to then calculate tech doc metrics based on it, but page quality ML models are only intended for use on Wikipedias, not on technical wikis (so far). The criteria for quality in the tech docs context also differ from the encylopedic context, so the models may not be applicable even if they were available for content on mediawiki.org.