Jump to content

Web scraping access

From mediawiki.org

Web scraping access, also commonly referred to as screen scraping, involves requesting a MediaWiki page using index.php, looking at the raw HTML code (what you would see if you clicked View → Source in most browsers), and then analyzing the HTML for patterns.

There are certain problems with this approach: the MediaWiki interface can change without notice, which may break the bot code, and calling for HTML creates a larger server load than processing the wikitext itself.

Web scraping to access content on MediaWiki sites before the API was added to MediaWiki by tools such as:

Some tools used a mixed mode, using web scraping for accessing UI and navigational information, while using raw unprocessed wikitext for page processing by requesting index.php?action=raw. The API module revisions is roughly equivalent to action=raw, and allows for retrieving additional information.

Most MediaWiki libraries now use the API.

Even after the API was well established, web-scraping was still frequently used. e.g. for early versions of various Mobile technology, and even in 2016 Pywikibot "compat" was still being used.

There is basically no reason to use this technique anymore.

See also

[edit]