Page Previews/API Specification

For documentation on the completed API, see Page Content Service and the live API spec.

This article outlines the specification for a new Node.js based API to generate summaries for MediaWiki based wikis that replaces the existing TextExtracts API.

Background & Motivation

Up until now, we've mostly gotten away with using the prop=extracts MediaWiki API provided by TextExtracts and RESTBase to allow us to scale out Page Previews to a couple of large Wikipedias without issue.

However, the requirement that certain classes of pages should be handled differently means that TextExtracts is no longer the most appropriate place to house the notion of what a page preview is. We should aim to keep TextExtracts as simple and as general as possible. It may be that we compose the prop=extracts API and the new Page Preview API rather than integrating them but this is not a goal of this work.

To be clear, the primary goal of this work is to minimise the amount of text/HTML processing in the Page Previews client: the less work the client has to do to display a preview, the better.

The specification

Intros

The API returns well-formed HTML representing the introductory elements of a page, which are defined as follows:

The first paragraph from the introductory section.
The first ordered, unordered, or definition list that is the next sibling of the first paragraph.

Herein we'll refer to these elements as an "intro".

Plaintext intros

Certain clients will not be able to handle HTML intros yet, e.g. the Wikipedia apps. To maintain compatibility with these clients, the API will also return a plaintext representation of the introductory elements of a page.

https://gerrit.wikimedia.org/r/370694

Empty intros

After the HTML intro has been processed (see below), it may not contain text content but still contain HTML, e.g. . Any processed intro that doesn't contain text content must be considered empty.

Implemented

Markup allowed in an intro

By default, the Page Preview API (herein "the API") must remove any tag that doesn't fall into one of the following cases.

Emphasis

The API must retain any bolded or italicised text in the intro, i.e. the Page Preview API must not remove b, i, and em tags.

Implemented

Formulae/MathML

In order to support browsers that don't support MathML, the API:

Must remove math tags; and
Must not remove either the inline or block layout fallback images generated by Math while parsing the page.

Implemented

Super- and subscript

The API must retain all sup and sub tags that are not generated by Cite, i.e.  elements.

Implemented

Stripping of parenthetical statements

The API must remove all content enclosed within balanced parentheses. Parentheses will be defined as the following characters: () and （）

Implemented

Flattening inline elements

The API must replace all span and a tags with their text content, e.g. Foo should be flattened to Foo and <a href="/foo">Foo</a> would be flattened to Foo.

Implemented

`noexcerpt`

The API must remove any element with the noexcerpt class to replicate the current behaviour of TextExtracts.

Implemented

Line breaks

It is assumed that any line breaks in the summary are necessary for the display of the content. We thus do not remove any instance of a line break that appears in the lead paragraph of a summary.

Request

Parameters

Name	Type	Description
title	String	The title of the page to get the intro for.

Implemented

Responses

A successful response from the Page Preview API similarly to all existing endpoints, must have the following properties:

Name	Type	Description
titles	Titles	The various titles of the page.
lang	String	The 2 or 3 character ISO 639-3/ISO 639-1 code of the language of the intro. This should be the site content language or the page content language.
dir	Enum	The direction of the script used to render the language the intro. One of "ltr" or "rtl".
last_modified	String	The time at which the page was last modified in ISO 8601 format.
thumbnail	?Image	The thumbnail of the image associated with the page. The thumbnail's largest side must not exceed 320px. By default, this property should not be present.
original	?Image	The original of the image associated with the page as determined by PageImages. By default, this property should not be present.
wikidata_description	?String	The description of the Wikidata item.

The new summary endpoint will hydrate these properties with the additional fields specific to summaries:

Name	Type	Description
type	Enum	The notional type of the intro. One of "disambiguation", "wikidata", or "standard".
intro	String	The intro of the page represented as well-formed HTML5.
plaintext_intro	String	The intro of the page represented as plaintext. This property supersedes the `extract` property of the current RESTBase Page Summary endpoint.
disambiguation_links	?Titles[]	The titles of the first N links from the disambiguation page. By default, this property should not be present.

Done

Where an Image type property must have the following properties:

Name	Type	Description
source	String	The URL of the image.
width	Integer	The width of the image in px.
height	Integer	The height of the image in px.

And a Titles type property must have the following properties:

Name	Type	Description
denormalized	String	The title of the page, e.g. File:Igorrr_(band).
normalized	String	The normalized title of the page, e.g. Igorrr (band).jpg
display	String	The editor-formatted title of the page (see https://www.mediawiki.org/wiki/Help:Magic_words#Displaytitle), e.g. <strong>Igorrr (band).jpg</strong>.
namespace_id	Integer	The ID of the namespace that the page is in on the wiki.
namespace_name	String	The localized name of the namespace, e.g. User, Usario, etc.
page_id	Integer	The internal ID of the page.

For a page in the wiki's content namespace(s)

The Page Preview API must respond with 200 OK.

The type property of the response must be set to "standard".

If the page has a corresponding Wikidata item, then the wikidata_description property must be set to the item's description.

Implemented

For a page outside of the wiki's content namespaces

The Page Preview API must respond with 200 OK.

The type property of the response must be set to "no-extract".

The extract and extract_html properties must be set to "".

Implemented

For a page that doesn't use the wikitext, wikibase-item, or wikibase-property content model

The Page Preview API must respond with 200 OK.

The type property of the response must be set to "no-extract".

The extract and extract_html properties must be set to "".

Implemented

For a disambiguation page

The Page Preview API must respond with 200 OK.

The type property of the response must be set to "disambiguation".

The disambiguation_links property of the response must be set to the first N links from the disambiguation page.

The intro property of the response should be set to the intro of the page so that the client may display it if appropriate.

Blocked

For a page that doesn't exist

The Page Preview API must respond with 404 Not Found.

The response body must be empty.

Implemented

For a page that doesn't have a lead section

The Page Preview API must respond with 200 OK.

The type property of the response must be set to "standard".

The intro property of the response must be set to "".

Examples

https://en.wikipedia.org/wiki/Wikipedia:Dashboard

Implemented

For a page that has an empty intro

The response must be the same as the "For a page that doesn't have a lead section" case.

Implemented

For a page that redirects to another page

~~The Page Preview API must respond with 302 Found.~~

~~The Location HTTP header must be set to the URL that will get the intro for the target page.~~

Note: RESTBase handles redirects transparently to the underlying service (see T176517#3634838).

The Page Preview API must respond with 200 OK.

The type property of the response must be set to "no-extract".

The extract and extract_html properties must be set to "".

Responses for Wikidata (from T111231: Page previews for Wikidata)

For a Wikidata item

This overrides the "For a page in the wiki's content namespace" case above.

The type property must be set to "wikidata_preview".

All members of the titles property object must be set to their equivalent of the item's label.

The extract property must be set to the item's description.

If the item has the image property set (to I):

The image property must be set to the Image object that represents the Wikimedia Commons file referenced by I.

The thumbnail property must be set to the Image object that represents the corresponding thumbnail.

Notes

The item's description should be in the user's language. If the description isn't available in the user's language, then the API must follow the language fallback chain until one is available.

For a Wikidata item with no description

The response should be the same as the For a Wikidata item case apart from the following:

The extract and extract_html properties of the response must be set to "".