Jump to content

Evaluating and Improving MediaWiki web API client libraries/Status updates/API:Tutorial notes

From mediawiki.org

Tutorial for MediaWiki's RESTful web service API

[edit]
In 2015 we changed to consistently calling this the "action API"
  • The API tutorial leads you through hands-on exercises and includes a training video.

This page contains the notes used to present Roan Kattouw's 2012 workshop on the MediaWiki web API and is not a complete tutorial on the MediaWiki web API by any means. It provides a quick introduction to the API, useful recipes for common GET requests, and a list of available resources with more information.

Speaker: Roan Kattow, maintained the MediaWiki API 2007-2009. Event: 2012 hackathon.

API: RESTful web API Other APIs -- API is overloaded. This is the web API.


Why should you use the web API? Bots (automated edits) , AJAX (programmatically looking stuff up, Javascript features (gadgets?), Gadgets, other things.

Roan says: generally any Ajax feature is going to use the api.php entry point. But right now the easiest thing to do is to write a bot or to use the API clients.

There are other things that also get casually called the MediaWiki API, like the internal interfaces that extensions and special pages can hook into. We're not talking about that right now, just the web API.

(possibly talk about how it works from the back end, if people ask)...

Everything in the database that's not private is exposed (public userpages too). Data and metadata; links between pages, images used on pages, history metadata, and more.

Don't have semantic stuff like "what's the definition of this word in wiktionary?" Does retrieve page text, or history of page, or etc. (Doesn't parse pages b/c pages are freely structured.) Includes geodata.

So, you send GET or POST HTTP requests [elaborate here] and get JSON or XML format back (XML to be deprecated). (can ask for others).

(says jsonP needs to be documented better).

no such thing as plaintext, just wikitext

For this, using w:en:Special:ApiSandbox {make a redirect? Consistently name these across all wikis???}

"not optimized for educational use just yet"

Basic way things work: [blah.api.php][query string]

  • there are a bunch of wikis running MediaWiki software (the thing that runs wikipedia, etc. etc.)
  • each of these will have its own blah.api.php page
  • in this talk, we're using en.wikipedia.org/w/api.php.
  • MW writes code, deploys it to WMF site, releases it as a tarball after problems have been found.
  • So different wikis _will_ be running different versions of the MW software and diff versions of the API, so use api.php as authoritative documentation
  • But there are not usually changes that break existing usage, any such changes will be announced on the api-announce list.

GIANT YET TERSE AUTODOCUMENTED API PAGE useful bits: what parameters and what usages are accepted.

but you don't have to do that! In this case we will be using the API sandbox. It has stuff like example queries though it is not great.

Format: usually want json. Demo uses xml, but this will be deprecated!

Action parameter is most often 'query'. There are tons of other ones but when in doubt look in query first. "Asking" for data from database.

Most of the parameters you will never use! Generally ok to leave them blank.

can click "examples" and it will fill out example for you.

fill it out, click "make request"

If you max out the limit, the query continue element will tell you how to get more.

imglimit will take "max" though it should otherwise be a number. Max depends on account's limit.

How many can you make? No limits on read/second, but they reserve the right to block you. Some limits generally on edits/second (not API-specific). Community will probably block "editing rampages"

Searching for all images; request the maximum, not enough. "<querycontinue> tells you how to continue, in this case set "aifrom" to "[imagename].jpg" stopped at 21 min

Download dump ~12 TB expands to ~150 TB. Shouldn't use it unless you need A LOT of data, if you just need a few ~k things, use the API, it's more efficient. Wikipedia offline: (linked in resources)


BRief word about editing. make an edit, move a page, needs POST request. two-step token required process, you don't need a personal one, but you do need one for security reasons. Have to provide a token when you make an edit. Details "kind of involved in some cases"; read the docs or talk to Roan.


So: query types are common but not obvious. (images search, example, yields metadata) page ID is used internally to refer to a page, not usually for outside consumption pages can be renamed, page ID will remain the same.

Getting URLs for files requires generators, more advanced.

So, getting more metadata: prop = info, (gets you metadata) can get metadata for multiple pages with pipes (<=50 pages) (San Francisco|Kanichar|Cooperative principle)


Getting history of a page: under revisions (for historical reasons) (gets you metadata; users, comments, revision/parent IDs, etc) can step through history of page like this, fuss with limits.

[if you're going to be making a lot of requests, consider combining them to make it easier for everyone] [cannot do multiple revisions for multiple pages; sandbox will tell you forbidden combinations]


Getting wikitext content: for historical reasons it is in the revisions API >.< rvprop=content

Can also get content of multiple pages, if you're only interested in the latest version (have to not pass "limit")

Wikitext (with everything messy, templates, links, wikitext formatting, etc)


How you get HTML: action=parse (not query) page (not title) =page_you_want

Gives you: rendered HTML text of page, JUST HTML of article. Also metadata that the parser found when it was parsing the article--categories, links, templates, etc. (most of these can get

parse: gives you html text of page; external links in wikitext; "sections" structure of page.

(it gives all the metadata because google asked...)


Question: is everything on the mobile the same? Content is not different, wikitext base is same, HTML rendered to mobile might be different.

"sections" 3 ways to refer to section, they're all in there :P


Section 0 is the stuff in the beginning of the page.

So go back to parse-page=San_Francisco. "section" parameter takes the "index" of the sections.

ToC is not part of wikitext, autogenerated. It'll be in HTML but not in wikitext.

can add "section=N", only get one section. (including section 0, the first section)


Nope on jQuery and AJAX.

Resources: autogen documentation; documentation on MediaWiki, may be outdated. API:Sandbox. mailing list; announcements list, if they make a change that might break existing tools, it will be announced there. If you use the API, subscribe--

  1. mediawiki is general support channel.

Questions:

is action=query less resource-intensive than action=parse? It will pull from parsing cache and will be fast. Generally the guide to not exhausting resources is don't make parallel requests.


Definitions

[edit]
  • REST API for MediaWiki
    • exposes things MediaWiki has in the database or otherwise understands
    • does not include semantic stuff like "definition of a word in Wiktionary" or even "lead paragraph of an article"
    • usage: send HTTP requests (GET or POST) to the api.php URL, receive XML or JSON or other formats. You'll usually want JSON or XML.

How to use it

[edit]
  • Entry point: http://en.wikipedia.org/w/api.php (see API:Main page#The endpoint)
    • or any other wiki
      • Talk about versioning and how non-WMF wikis might have different version of MediaWiki and thus the API
    • https works too!
  • Parameters are passed in query string. Not passing any will give you the help page with the autogenerated documentation.

Follow along by using w:en:Special:ApiSandbox -- query is what you will usually want.

  • Example query: ?action=query&titles=San_Francisco&prop=images&imlimit=20&format=jsonfm
    • action=query is used for most read actions, separate action= modules exist for write actions
    • titles= takes one or more titles for the query to operate on
    • prop=images lists the images on a page; lots of other stuff in prop=, list=, meta=
    • limit= sets the max # of results. Default is 10, 'max' works
    • popular values for format= : xml, json, xmlfm (default), jsonfm
    • If you want to find sections from the table of contents, use section= using the index property, and you can call 0 for the wikitext that comes before the first section header.
  • State-changing actions (e.g. editing)

If you want to make a lot of API calls, and perhaps run very busy and active bots et al., please talk to the admins of that wiki ahead of time so they do not block you. Also run your requests in serial, not parallel. resource for contacting them to go here. TODO

  • There are limits in the software on how many edits per second you can make.
  • Example nouns to look up:
Kanichar
Kolar_Gold_Fields
Cooperative_principle
MS_Riverdance


Magic recipes

[edit]

Nonobvious and very useful.

  • Things you'll definitely need:
    • prop=info for basic page info
    • prop=revisions for page history
    • prop=revisions&rvprop=content for page wikitext
    • action=parse for page HTML
  • Doing crazy stuff
    • multiple titles with titles=Foo|Bar|Baz (This will make multiple calls count as one for the purpose of rate limiting)
      • This works for pages but not revisions. Read the documentation via the Sandbox or via api.php autodocs.
    • multiple modules with &prop=images|templates&list=allpages|blocks
    • generators (kind of like UNIX pipes) with &titles=Foo&generator=links&prop=revisions

Resources

[edit]
  • Getting help
    • Autogenerated documentation: api.php with no parameters such as https://en.wikipedia.org/w/api.php
    • Documentation on mediawiki.org: API:Action API (details about specific modules/parameters often outdated, autogenerated docs are authoritative)
    • The API Sandbox -- example w:en:Special:ApiSandbox
    • mail:mediawiki-api -- mediawiki-api@lists.wikimedia.org
    • mediawiki-api-announce@lists.wikimedia.org - PLEASE subscribe because we tell you about breaking changes, which happen a few times every year. mail:mediawiki-api-announce
    • #mediawiki connect on IRC
    • me! (Roan Kattouw)

You may actually want

[edit]