Jump to content

Wikimedia Product/Wikimedia Product Infrastructure team/Action API request analytics

From mediawiki.org

Action API request analytics will be reports and/or dashboards to track usage of the MediaWiki Action API for Wikimedia production websites. This tracking is intended to be similar to the Pageviews tracking that is currently done by the Analytics team for articles in the main namespace.

Desired outcome

[edit]

Data sets providing:

  • Number of user agents coming from Labs or third party services, on a monthly basis
  • Volume of API requests coming from Labs or third party services, on a monthly basis
  • Ranking of user agents coming from Labs or third party services with a highest activity, on a monthly basis
  • Ranking of most requested actions/parameters, on a monthly basis

Data acquisition

[edit]

Raw Action API requests will be tracked using MediaWiki structured logging, Kafka and Hive.

  1. Yes Done Log events will be emitted by MediaWiki for each Action API request using a structured logging context that contains the data needed to populate the Hive tables. Gerrit change 240614
  2. Yes Done Monolog will be configured to route these log events to a Kafka topic.
  3. Yes Done Camus will process events from the Kafka topic and load them into a raw data table in Hive.
  4. Oozie will run a (daily?) Hive script to summarize the raw data table into various aggregate tables designed for specific reporting needs via ETL processing.
  5. Oozie will run a Hive script to discard the raw request data after processing to reduce the risk of leaking sensitive data due to a network break or malicious actor.
  6. Oozie will run Hive script to generate monthly summary data from the aggregate tables for export to interested parties.

Avro schema

[edit]
  {
      "type": "record",
      "name": "ApiRequest",
      "namespace": "org.wikimedia.mediawiki.api",
      "doc": "A single request to the MediaWiki Action API (api.php)",
      "fields": [
          { "name": "ts",               "type": "int" },
          { "name": "ip",               "type": "string" },
          { "name": "userAgent",        "type": "string" },
          { "name": "wiki",             "type": "string" },
          { "name": "timeSpentBackend", "type": "int" },
          { "name": "hadError",         "type": "boolean" },
          { "name": "errorCodes",       "type": {
              "type": "array", "items": "string"
          }
          { "name": "params",          "type": {
              "type": "map", "values": "string"
          } }
      ]
  }

Hive schema

[edit]
-- Create tables for Action API stats
--
-- Usage:
--     hive -f create-action-tables.sql --database wmf

CREATE TABLE IF NOT EXISTS action_ua_hourly (
  userAgent STRING COMMENT 'Raw user-agent',
  wiki      STRING COMMENT 'Target wiki (e.g. enwiki)',
  ipClass   STRING COMMENT 'IP based origin, can be wikimedia, wikimedia_labs or internet',
  viewCount BIGINT COMMENT 'Number of requests'
)
COMMENT 'Hourly summary of Action API requests bucketed by user-agent and wiki'
PARTITIONED BY (
  year      INT COMMENT 'Unpadded year of request',
  month     INT COMMENT 'Unpadded month of request',
  day       INT COMMENT 'Unpadded day of request',
  hour      INT COMMENT 'Unpadded hour of request'
)
STORED AS PARQUET;


CREATE EXTERNAL TABLE IF NOT EXISTS action_action_hourly (
  action    STRING COMMENT 'Action parameter value',
  wiki      STRING COMMENT 'Target wiki (e.g. enwiki)',
  ipClass   STRING COMMENT 'IP based origin, can be wikimedia, wikimedia_labs or internet',
  viewCount BIGINT COMMENT 'Number of requests'
)
COMMENT 'Hourly summary of Action API requests bucketed by action and wiki'
PARTITIONED BY (
  year      INT COMMENT 'Unpadded year of request',
  month     INT COMMENT 'Unpadded month of request',
  day       INT COMMENT 'Unpadded day of request',
  hour      INT COMMENT 'Unpadded hour of request'
)
STORED AS PARQUET;


CREATE EXTERNAL TABLE IF NOT EXISTS action_param_hourly (
  action    STRING COMMENT 'Action parameter value',
  param     STRING COMMENT 'Parameter name, can be prop, list, meta, generator, etc',
  value     STRING COMMENT 'Parameter value',
  wiki      STRING COMMENT 'Target wiki (e.g. enwiki)',
  ipClass   STRING COMMENT 'IP based origin, can be wikimedia, wikimedia_labs or internet',
  viewCount BIGINT COMMENT 'Number of requests'
)
COMMENT 'Hourly summary of Action API requests bucketed by action, parameter, value and wiki'
PARTITIONED BY (
  year      INT COMMENT 'Unpadded year of request',
  month     INT COMMENT 'Unpadded month of request',
  day       INT COMMENT 'Unpadded day of request',
  hour      INT COMMENT 'Unpadded hour of request'
)
STORED AS PARQUET;

-- NOTE: there are many params we would not want to count distinct values of
-- at all (eg maxlag, smaxage, maxage, requestid, origin, centralauthtoken,
-- titles, pageids). It will be easier to whitelist in the ETL process
-- than to try and selectively blacklist.

(action, param, value) tuples

[edit]

We do not want to try and count all of the distinct (action, param, value) tuples that are seen in the aggregation tables. For some params we will also want to expand an embedded list of values given as a single parameter into a list of (action, param, value) tuples that should be counted individually.

For the initial ETL process we will count these tuples:

  • action=query
    • param=prop, value from exploding the '|' delimited value
    • param=list, value from exploding the '|' delimited value
    • param=meta, value from exploding the '|' delimited value
    • param=generator
  • action=flow
    • param=submodule
  • ???

Monthly reports

[edit]

Number of user agents coming from Labs or third party services, on a monthly basis

SELECT COUNT(DISTINCT user_agent) as visitors
FROM action_ua_hourly
WHERE year = :year
  AND month = :month;


Volume of API requests coming from Labs or third party services, on a monthly basis

SELECT ip_class, SUM(view_count) as hits
FROM action_ua_hourly
WHERE year = :year
  AND month = :month
GROUP BY ip_class;


Ranking of user agents coming from Labs or third party services with a highest activity, on a monthly basis

SELECT user_agent, SUM(view_count) as hits
FROM action_ua_hourly
WHERE year = :year
  AND month = :month
GROUP BY user_agent
ORDER BY hits desc;


Ranking of most requested actions/parameters, on a monthly basis

SELECT action, SUM(view_count) as hits
FROM action_action_hourly
WHERE year = :year
  AND month = :month
ORDER BY hits desc;

SELECT action, param, value, SUM(view_count) as hits
FROM action_param_hourly
WHERE year = :year
  AND month = :month
GROUP BY action, param, value
ORDER BY hits desc;

Magnitude estimates from existing data

[edit]

Some data on magnitude of the data set taken from the existing webrequests data for 2015-11-01:

  • Requests per day: 464,794,956
  • Distinct user agents: 337,360
  • Distinct user agents with >1,000,000 requests: 65
  • Distinct user agents with >100,000 requests: 446
  • Distinct user agents with >10,000 requests: 2,118
  • Distinct user agents with >1000 requests: 9,495
  • 50% of requests made by top 48 user agents
  • 75% of requests made by top 256 user agents
  • 95% of requests made by top 4,228 user agents
  • Top user agent: "-" (unspecified) 38,342,930 requests
  • Top user agent that is not a common web browser: "Peachy MediaWiki Bot API Version 2.0 (alpha 8)" 8,674,297 requests
  • 5 of top 10 user agents are web browsers (ajax requests for API data assumed)
  • Traffic percentages: 90% external, 9% labs, 1% internal (NOTE: this is traffic measured at the Varnish level which probably does not include most Parsoid/RESTBase requests)

Api requests per day FY2015/16 Q2

Average daily Action API requests
447,339,466 [1][2]
Maximum daily Action API requests
499,240,751 [1][2]
Average daily distinct User-Agents
386,449 [1][2]
Maximum daily distinct User-Agents
684,771 [1][2]


Top 10 user-agents 2015-12-01 through 2015-12-31 (13,241,976,328 requests)
Rank User-Agent Percent of total
1 no user agent specified 7.67%
2 Digplanet/1.0 2.03%
3 Peachy MediaWiki Bot 1.95%
4 Chrome 47.0.2526.106 1.81%
5 https://github.com/goldsmith/Wikipedia/ 1.62%
6 IE 11.0 1.62%
7 ArtistPedia/1.1 1.43%
8 Firefox 42 1.34%
9 Chrome 47.0.2526.80 1.06%
10 Chrome 47.0.2526.106 0.93%

References

[edit]
  1. ↑ 1.0 1.1 1.2 1.3 (2015-11-01 to 2016-01-13)
  2. ↑ 2.0 2.1 2.2 2.3 Measurements taken from wmf.webrequest data measured at Varnish cache

See also

[edit]