Page Previews/API Specification
This article outlines the specification for a new Node.js based API to generate summaries for MediaWiki based wikis that replaces the existing TextExtracts API.
Background & Motivation
[edit]Up until now, we've mostly gotten away with using the prop=extracts
MediaWiki API provided by TextExtracts and RESTBase to allow us to scale out Page Previews to a couple of large Wikipedias without issue.
However, the requirement that certain classes of pages should be handled differently means that TextExtracts is no longer the most appropriate place to house the notion of what a page preview is. We should aim to keep TextExtracts as simple and as general as possible. It may be that we compose the prop=extracts
API and the new Page Preview API rather than integrating them but this is not a goal of this work.
To be clear, the primary goal of this work is to minimise the amount of text/HTML processing in the Page Previews client: the less work the client has to do to display a preview, the better.
The specification
[edit]Intros
[edit]The API returns well-formed HTML representing the introductory elements of a page, which are defined as follows:
- The first paragraph from the introductory section.
- The first ordered, unordered, or definition list that is the next sibling of the first paragraph.
Herein we'll refer to these elements as an "intro".
Plaintext intros
[edit]Certain clients will not be able to handle HTML intros yet, e.g. the Wikipedia apps. To maintain compatibility with these clients, the API will also return a plaintext representation of the introductory elements of a page.
https://gerrit.wikimedia.org/r/370694
Empty intros
[edit]After the HTML intro has been processed (see below), it may not contain text content but still contain HTML, e.g. <p><b></b></p>
. Any processed intro that doesn't contain text content must be considered empty.
Markup allowed in an intro
[edit]By default, the Page Preview API (herein "the API") must remove any tag that doesn't fall into one of the following cases.
Emphasis
[edit]The API must retain any bolded or italicised text in the intro, i.e. the Page Preview API must not remove b
, i
, and em
tags.
Formulae/MathML
[edit]In order to support browsers that don't support MathML, the API:
- Must remove
math
tags; and - Must not remove either the inline or block layout fallback images generated by Math while parsing the page.
Super- and subscript
[edit]The API must retain all sup
and sub
tags that are not generated by Cite, i.e. <sup class="reference">
elements.
Stripping of parenthetical statements
[edit]The API must remove all content enclosed within balanced parentheses. Parentheses will be defined as the following characters: () and ļ¼ ļ¼
Flattening inline elements
[edit]The API must replace all span
and a
tags with their text content, e.g. <span>Foo</span>
should be flattened to Foo
and <a href="/foo">Foo</a>
would be flattened to Foo
.
noexcerpt
[edit]The API must remove any element with the noexcerpt
class to replicate the current behaviour of TextExtracts.
Line breaks
[edit]It is assumed that any line breaks in the summary are necessary for the display of the content. We thus do not remove any instance of a line break that appears in the lead paragraph of a summary.
Request
[edit]Parameters
[edit]Name | Type | Description |
---|---|---|
title | String | The title of the page to get the intro for. |
Responses
[edit]A successful response from the Page Preview API similarly to all existing endpoints, must have the following properties:
Name | Type | Description |
---|---|---|
titles | Titles | The various titles of the page. |
lang | String | The 2 or 3 character ISO 639-3/ISO 639-1 code of the language of the intro. This should be the site content language or the page content language. |
dir | Enum | The direction of the script used to render the language the intro. One of "ltr" or "rtl". |
last_modified | String | The time at which the page was last modified in ISO 8601 format. |
thumbnail | ?Image | The thumbnail of the image associated with the page. The thumbnail's largest side must not exceed 320px. By default, this property should not be present. |
original | ?Image | The original of the image associated with the page as determined by PageImages. By default, this property should not be present. |
wikidata_description | ?String | The description of the Wikidata item. |
The new summary endpoint will hydrate these properties with the additional fields specific to summaries:
Name | Type | Description |
---|---|---|
type | Enum | The notional type of the intro. One of "disambiguation", "wikidata", or "standard". |
intro | String | The intro of the page represented as well-formed HTML5. |
plaintext_intro | String | The intro of the page represented as plaintext. This property supersedes the extract property of the current RESTBase Page Summary endpoint.
|
disambiguation_links | ?Titles[] | The titles of the first N links from the disambiguation page. By default, this property should not be present. |
Ā Done
Where an Image
type property must have the following properties:
Name | Type | Description |
---|---|---|
source | String | The URL of the image. |
width | Integer | The width of the image in px. |
height | Integer | The height of the image in px. |
And a Titles
type property must have the following properties:
Name | Type | Description |
---|---|---|
denormalized | String | The title of the page, e.g. File:Igorrr_(band). |
normalized | String | The normalized title of the page, e.g. Igorrr (band).jpg |
display | String | The editor-formatted title of the page (see https://www.mediawiki.org/wiki/Help:Magic_words#Displaytitle), e.g. <strong>Igorrr (band).jpg</strong>. |
namespace_id | Integer | The ID of the namespace that the page is in on the wiki. |
namespace_name | String | The localized name of the namespace, e.g. User, Usario, etc. |
page_id | Integer | The internal ID of the page. |
For a page in the wiki's content namespace(s)
[edit]The Page Preview API must respond with 200 OK.
The type
property of the response must be set to "standard"
.
If the page has a corresponding Wikidata item, then the wikidata_description
property must be set to the item's description.
For a page outside of the wiki's content namespaces
[edit]The Page Preview API must respond with 200 OK.
The type
property of the response must be set to "no-extract"
.
The extract
and extract_html
properties must be set to ""
.
For a page that doesn't use the wikitext, wikibase-item, or wikibase-property content model
[edit]The Page Preview API must respond with 200 OK.
The type
property of the response must be set to "no-extract"
.
The extract
and extract_html
properties must be set to ""
.
For a disambiguation page
[edit]The Page Preview API must respond with 200 OK.
The type
property of the response must be set to "disambiguation"
.
The disambiguation_links
property of the response must be set to the first N links from the disambiguation page.
The intro
property of the response should be set to the intro of the page so that the client may display it if appropriate.
Blocked
For a page that doesn't exist
[edit]The Page Preview API must respond with 404 Not Found.
The response body must be empty.
For a page that doesn't have a lead section
[edit]The Page Preview API must respond with 200 OK.
The type
property of the response must be set to "standard"
.
The intro
property of the response must be set to ""
.
Examples
[edit]For a page that has an empty intro
[edit]The response must be the same as the "For a page that doesn't have a lead section" case.
For a page that redirects to another page
[edit]The Page Preview API must respond with 302 Found.
The
Location
HTTP header must be set to the URL that will get the intro for the target page.
Note: RESTBase handles redirects transparently to the underlying service (see T176517#3634838).
The Page Preview API must respond with 200 OK.
The type
property of the response must be set to "no-extract"
.
The extract
and extract_html
properties must be set to ""
.
Responses for Wikidata (from T111231: Page previews for Wikidata)
[edit]For a Wikidata item
[edit]This overrides the "For a page in the wiki's content namespace" case above.
The type
property must be set to "wikidata_preview".
All members of the titles
property object must be set to their equivalent of the item's label.
The extract
property must be set to the item's description.
If the item has the image property set (to I):
- The
image
property must be set to theImage
object that represents the Wikimedia Commons file referenced by I.
- The
thumbnail
property must be set to theImage
object that represents the corresponding thumbnail.
Notes
[edit]The item's description should be in the user's language. If the description isn't available in the user's language, then the API must follow the language fallback chain until one is available.
For a Wikidata item with no description
[edit]The response should be the same as the For a Wikidata item case apart from the following:
The extract
and extract_html
properties of the response must be set to ""
.