Wikimedia Product/Wikimedia Product Infrastructure team/Action API request analytics
Action API request analytics will be reports and/or dashboards to track usage of the MediaWiki Action API for Wikimedia production websites. This tracking is intended to be similar to the Pageviews tracking that is currently done by the Analytics team for articles in the main namespace.
Desired outcome
[edit]Data sets providing:
- Number of user agents coming from Labs or third party services, on a monthly basis
- Volume of API requests coming from Labs or third party services, on a monthly basis
- Ranking of user agents coming from Labs or third party services with a highest activity, on a monthly basis
- Ranking of most requested actions/parameters, on a monthly basis
Data acquisition
[edit]Raw Action API requests will be tracked using MediaWiki structured logging, Kafka and Hive.
- Â Done Log events will be emitted by MediaWiki for each Action API request using a structured logging context that contains the data needed to populate the Hive tables. Gerrit change 240614
- Â Done Monolog will be configured to route these log events to a Kafka topic.
- Â Done Camus will process events from the Kafka topic and load them into a raw data table in Hive.
- Oozie will run a (daily?) Hive script to summarize the raw data table into various aggregate tables designed for specific reporting needs via ETL processing.
- Oozie will run a Hive script to discard the raw request data after processing to reduce the risk of leaking sensitive data due to a network break or malicious actor.
- Oozie will run Hive script to generate monthly summary data from the aggregate tables for export to interested parties.
Avro schema
[edit]{ "type": "record", "name": "ApiRequest", "namespace": "org.wikimedia.mediawiki.api", "doc": "A single request to the MediaWiki Action API (api.php)", "fields": [ { "name": "ts", "type": "int" }, { "name": "ip", "type": "string" }, { "name": "userAgent", "type": "string" }, { "name": "wiki", "type": "string" }, { "name": "timeSpentBackend", "type": "int" }, { "name": "hadError", "type": "boolean" }, { "name": "errorCodes", "type": { "type": "array", "items": "string" } { "name": "params", "type": { "type": "map", "values": "string" } } ] }
- Gerrit change 265164 Avro schema
- Gerrit change 240614, Gerrit change 265507, Gerrit change 271673 Implementation of matching 'ApiRequest' log channel for MediaWiki core
- Gerrit change 273559 Configuration patch to send 'ApiRequest' channel to Kafka
Hive schema
[edit]-- Create tables for Action API stats
--
-- Usage:
-- hive -f create-action-tables.sql --database wmf
CREATE TABLE IF NOT EXISTS action_ua_hourly (
userAgent STRING COMMENT 'Raw user-agent',
wiki STRING COMMENT 'Target wiki (e.g. enwiki)',
ipClass STRING COMMENT 'IP based origin, can be wikimedia, wikimedia_labs or internet',
viewCount BIGINT COMMENT 'Number of requests'
)
COMMENT 'Hourly summary of Action API requests bucketed by user-agent and wiki'
PARTITIONED BY (
year INT COMMENT 'Unpadded year of request',
month INT COMMENT 'Unpadded month of request',
day INT COMMENT 'Unpadded day of request',
hour INT COMMENT 'Unpadded hour of request'
)
STORED AS PARQUET;
CREATE EXTERNAL TABLE IF NOT EXISTS action_action_hourly (
action STRING COMMENT 'Action parameter value',
wiki STRING COMMENT 'Target wiki (e.g. enwiki)',
ipClass STRING COMMENT 'IP based origin, can be wikimedia, wikimedia_labs or internet',
viewCount BIGINT COMMENT 'Number of requests'
)
COMMENT 'Hourly summary of Action API requests bucketed by action and wiki'
PARTITIONED BY (
year INT COMMENT 'Unpadded year of request',
month INT COMMENT 'Unpadded month of request',
day INT COMMENT 'Unpadded day of request',
hour INT COMMENT 'Unpadded hour of request'
)
STORED AS PARQUET;
CREATE EXTERNAL TABLE IF NOT EXISTS action_param_hourly (
action STRING COMMENT 'Action parameter value',
param STRING COMMENT 'Parameter name, can be prop, list, meta, generator, etc',
value STRING COMMENT 'Parameter value',
wiki STRING COMMENT 'Target wiki (e.g. enwiki)',
ipClass STRING COMMENT 'IP based origin, can be wikimedia, wikimedia_labs or internet',
viewCount BIGINT COMMENT 'Number of requests'
)
COMMENT 'Hourly summary of Action API requests bucketed by action, parameter, value and wiki'
PARTITIONED BY (
year INT COMMENT 'Unpadded year of request',
month INT COMMENT 'Unpadded month of request',
day INT COMMENT 'Unpadded day of request',
hour INT COMMENT 'Unpadded hour of request'
)
STORED AS PARQUET;
-- NOTE: there are many params we would not want to count distinct values of
-- at all (eg maxlag, smaxage, maxage, requestid, origin, centralauthtoken,
-- titles, pageids). It will be easier to whitelist in the ETL process
-- than to try and selectively blacklist.
(action, param, value) tuples
[edit]We do not want to try and count all of the distinct (action, param, value) tuples that are seen in the aggregation tables. For some params we will also want to expand an embedded list of values given as a single parameter into a list of (action, param, value) tuples that should be counted individually.
For the initial ETL process we will count these tuples:
- action=query
- param=prop, value from exploding the '|' delimited value
- param=list, value from exploding the '|' delimited value
- param=meta, value from exploding the '|' delimited value
- param=generator
- action=flow
- param=submodule
- ???
Monthly reports
[edit]Number of user agents coming from Labs or third party services, on a monthly basis
SELECT COUNT(DISTINCT user_agent) as visitors
FROM action_ua_hourly
WHERE year = :year
AND month = :month;
Volume of API requests coming from Labs or third party services, on a monthly basis
SELECT ip_class, SUM(view_count) as hits
FROM action_ua_hourly
WHERE year = :year
AND month = :month
GROUP BY ip_class;
Ranking of user agents coming from Labs or third party services with a highest activity, on a monthly basis
SELECT user_agent, SUM(view_count) as hits
FROM action_ua_hourly
WHERE year = :year
AND month = :month
GROUP BY user_agent
ORDER BY hits desc;
Ranking of most requested actions/parameters, on a monthly basis
SELECT action, SUM(view_count) as hits
FROM action_action_hourly
WHERE year = :year
AND month = :month
ORDER BY hits desc;
SELECT action, param, value, SUM(view_count) as hits
FROM action_param_hourly
WHERE year = :year
AND month = :month
GROUP BY action, param, value
ORDER BY hits desc;
Magnitude estimates from existing data
[edit]Some data on magnitude of the data set taken from the existing webrequests data for 2015-11-01:
- Requests per day: 464,794,956
- Distinct user agents: 337,360
- Distinct user agents with >1,000,000 requests: 65
- Distinct user agents with >100,000 requests: 446
- Distinct user agents with >10,000 requests: 2,118
- Distinct user agents with >1000 requests: 9,495
- 50% of requests made by top 48 user agents
- 75% of requests made by top 256 user agents
- 95% of requests made by top 4,228 user agents
- Top user agent: "-" (unspecified) 38,342,930 requests
- Top user agent that is not a common web browser: "Peachy MediaWiki Bot API Version 2.0 (alpha 8)" 8,674,297 requests
- 5 of top 10 user agents are web browsers (ajax requests for API data assumed)
- Traffic percentages: 90% external, 9% labs, 1% internal (NOTE: this is traffic measured at the Varnish level which probably does not include most Parsoid/RESTBase requests)
- Average daily Action API requests
- 447,339,466 [1][2]
- Maximum daily Action API requests
- 499,240,751 [1][2]
- Average daily distinct User-Agents
- 386,449 [1][2]
- Maximum daily distinct User-Agents
- 684,771 [1][2]
Rank | User-Agent | Percent of total |
---|---|---|
1 | no user agent specified | 7.67% |
2 | Digplanet/1.0 | 2.03% |
3 | Peachy MediaWiki Bot | 1.95% |
4 | Chrome 47.0.2526.106 | 1.81% |
5 | https://github.com/goldsmith/Wikipedia/ | 1.62% |
6 | IE 11.0 | 1.62% |
7 | ArtistPedia/1.1 | 1.43% |
8 | Firefox 42 | 1.34% |
9 | Chrome 47.0.2526.80 | 1.06% |
10 | Chrome 47.0.2526.106 | 0.93% |
References
[edit]See also
[edit]- Analytics/Data/ApiAction schema
- Analytics/Data/Webrequest schema
- Analytics/Data/Pageview_hourly schema