Analytics/Wikistats/TrafficReports

Migrate / Replace

For a discussion on the future of these reports, see Analytics/Wikistats/TrafficReports/Future

Monthly wikistats page view trend reports, indirectly based on squid logs

Platform: wikistats server stat1002
Update frequency: daily
Input: hourly per-wiki page view totals (files named projectcounts-yyyymmdd-hhnnss) ... produced by webstatscollector downloadable from dump server
Output

csv file for monthly Report Card
daily updated monthly page view stats. These reports are available in 4 variations: for desktop and mobile traffic, normalized (all months recalculated to 30 days) or original. There are reports for every project and a report for all project combined. See sitemap.

Monthly wikistats traffic reports (geo & traffic type), based on un|sampled squid logs

Update frequency

Monthly (manual run)

Scripts location

github, dev/beta on squid1002:/home/ezachte/wikistats/squids

Input

stat1002:/a/squid/archive/sampled

SquidCountArchive.pl

Collects a large variety of data from a/squids/archive/sampled in two passes,

updates a few monthly files (in folder /a/wikistats_git/squids/csv/yyyy-mm), and
creates a large set of daily csv files in folder (../yyyy-mm/yyyy-mm-dd), subfolders ../private and ../public.

Modules

EzLib.pm (now at EZ ~/lib)
../SquidCountArchiveProcessLogRecord.pm
../SquidCountArchiveReadInput.pm
../SquidCountArchiveWriteOutput.pm

Phases

Pass/Phase 1: collect frequencies for all ip addresses, needed in phase 2 to filter addresses which are most likely from bots (freq > 1 in sampled squid log stands for 1000 real views/edits, esp. for edits this most likely means a bot (few false positive are accepted)
Pass/Phase 2: collect all other counts

Arguments

-t for test mode (run counts for short timespan)
-d date range, can be absolute or relative, specify one value for a single day, two values for a range of days
- -d yyyy/mm/dd[-yyyy/mm/dd] (slashes are optional). This is best for generating counts for a specific day or period.
- -d -m[-n] where m and n are number of days before today. This is best for daily cron scheduling (e.g. -d 1-7 => run for last 7 days before today, skip days for which output already exists).
-f [1|2|12|21], for force rerun phase 1 and/or 2, even when that phase completed successfully earlier
phase 1: check for existence SquidDataIpFrequencies.csv.bz2 in ../yyyy-mm/yyyy-mm-dd
phase 2: check for existence #Ready in ../yyyy-mm/yyyy-mm-dd

Input

1:1000 sampled squid log files from stat1002:/srv/squid/archive/sampled
Format

see also wikitech
1. Hostname
2. Sequence number
3. Current time in ISO 8601 format (oplus milliseconds), according ot the squid server's clock
4. Request time in ms
5. Client IP
6. Squid request status, HTTP status code
7. Reply size including HTTP headers
8. Request method (GET/POST etc)
9. URL
10. Squid hierarchy status, peer IP
11. MIME content type
12. Referer header
13. X-Forwarded-For header
14 User-Agent header

On file read country codes are added from local copy of MaxMinds free GeoLite Country service as follows:

open IN, "-|", "gzip -dc $file_in | /usr/local/bin/geoiplogtag 5" ; (author geoiplogtag Mark Bergsma), 5 means: 5th field contains ip address

Output

Most files are written to sub folder ../yyyy-mm/yyyy-mm-dd; SquidDataLogfileHeadTail.csv and SquidDataSequenceNumbersAllSquids.csv are reused between runs and updated in sub folder ../yyyy-mm
Notes

Files marked as private file can contain data which are covered by the Wikimedia Privacy Policy, such as ip addresses and/or referers. These files should never be published! Also these should be deleted after the maximum retention period, set by our privacy policy.
Count is number of occurences in one day sampled log, for actual traffic counts scale by factor 1000
Country codes are as used by MaxMinds free GeoLite Country service
There are several reasons why a requester is classified as bot
- agent string contains url, which by convention should only occur for crawlers (exception: Embedded Web Browser from: http://bsalsa.com/)
- name contains bot/crawler/parser/spider
- for edits only: ip address occurs more than once in sampled log on one day (few false positives accepted)

#Ready: This file signals successful completion of script's phase 2: collecting all counts except ip addresses.; On rerun phase 2 will not be redone when this file exists, except when argument -f 2 (force) is specified.

SquidDataAgents.csv

Free format agent strings sent by browser

Agent string
Count

SquidDataBinaries.csv

Image file name
Count

SquidDataClients.csv

Category
- - Desktop client
- M Mobile client
- E Engine (if Gecko or AppleWebKit)
- G,- Group (higher aggregation level of desktop clients)
- G,M Group (higher aggregation level of mobile clients)
Browser (client) brand and version
Share of total within category

SquidDataClientsByWiki.csv

Category
- - Desktop client
- M Mobile client
Client (~ browser) brand and version
Project (wikipedia, etc)
Language code
Share of total within category

SquidDataCountriesViews.csv

Bot (Y/N) (see notes)
Project/Language (e.g. Wikipedia English = 'wp:en')
Country code (as used by MaxMind)
Count

SquidDataCountriesSaves.csv

Bot (Y/N) (see notes)
Project/Language (e.g. Wikipedia English = 'wp:en')
Country code (as used by MaxMind)
Count

SquidDataCountriesViewsTimed.csv

Bot (Y/N) (see notes)
Project/Language (e.g. Wikipedia English = 'wp:en')
Country code
Count

SquidDataCrawlers.csv

SquidDataEditsSaves[..etc..].txt.bz2 (private file)

Capture all records describing edit/submit events for later analysis (will be obsolete when 1:1 non-sampled capture will be reactivated)

SquidDataExtensions.csv

File extension
Count

SquidDataGoogleBots.csv

Visits by googlebot from known Google ip addresses (note: list need manual update every now and then!)

Date
Ip range (3 triplets)
Count

SquidDataImages.csv

File size range (per 20 bytes)
Count

SquidDataIndexPhp.csv

Counts for edits and submits

Bot (Y/N) (see notes)
Project/Language (e.g. Wikipedia English = 'wp:en') target wiki
Project/Language (e.g. Wikipedia English = 'wp:en') referring wiki
Squid status code e.g. 'TCP_MISS/302'
Mime type e.g. 'text/html'
Arguments (with values for some parameters blocked out) e.g. 'action=edit&title=..'
Count

SquidDataIpFrequencies[..etc..].csv (private file)

Generated in pass/phase 1, used a.o. to flag multiple edits from same IP address on one day as bot edits (input is 1:1000 sampled file, few false positives accepted).
Also some distribution stats: x addresses occur y times, x addresses occur y+ times

Count
Ip address

SquidDataLanguages.csv

For Chrome, Firefox, Opera, Safari

Browser name
Language code
Count

SquidDataLogFilesHeadTail.csv: Caches timestamps found in first and last record from sampled log files /a/asquids/archive/sampled-1000.log-yyyymmdd.gz; Collecting these from a compressed log file is slow

SquidDataMethods.csv

Action e.g. GET
Status e.g. TCP_MISS/200
Count

SquidDataOpSys.csv

Category
- - Desktop client
- M Mobile client
- G Group (higher aggregation level)
OS name + version
Count
Share of total (- and M together 100%, G separately)

SquidDataOrigins.csv

source (internal or external)
domain top level (external e.g. 'yahoo', internal e.g. 'wp:en')
domain e.g. '.com'
file type: image, (html) page, other (e.g. css)

SquidDataReferers[..etc..].txt (private file): mainly for debugging (see comments in file)

SquidDataRequests.csv

(see also comments in file)

Project e.g. '@wp:en' for Mobile English Wikipedia
Referer e.g. 'www.google.ca'
Extension e.g. 'gif'
Mime type e.g. 'text/html'
Arguments (values for most argument omitted, arguments sorted alphabetically for better grouping), e.g. 'query=..&search=..&x=..&y=..'
Count

SquidDataScripts.csv

Type: css, js or php
Script name e.g. 'print.css'
Arguments (values for most argument omitted, arguments sorted alphabetically for better grouping), e.g. 'query=..&search=..&x=..&y=..'
Count

SquidDataSequenceNumbersAllSquids.csv

Ideally, if no UDP messages are lost, average distance between squid sequence numbers should be 1000 (1:1000 sampled log)

This file can detect mass message log and help to aurocorrect counts

Date
Hour
Events
Avg distance between sequence numbers

SquidDataSequenceNumbersPerSquidHour.csv

Ideally, if no UDP messages are lost, average distance between squid sequence numbers should be 1000 (1:1000 sampled log)

Squid id e.g. 'amssq31.esams.wikimedia.org'
Hour
Events
Tot distance between sequence numbers
Avg distance between sequence numbers

SquidDataSkins.csv

Skin file path e.g. 'skins-1.5/common/ajax.js'
Count

SquidCountryScan.pl

Collect per country page view stats for all months

Views:
- starting in July 2009 for page views
- input in stat1002:/wikistats_git/squids/csv

Edits:
- starting in November 2011 for page edits
- input in stat1002:/wikistats_git/squids/csv_edits

Output: input for SquidReportArchive.pl to produce breakdown of Wikipedia views/edits by geo info
SquidDataVisitsPerCountryMonthly.csv
SquidDataVisitsDaily.csv
SquidDataVisitsPerCountryPerWikiDaily.csv
SquidDataVisitsPerCountryPerProjectDaily.csv
SquidDataVisitsPerCountryDailyDetailed.csv
SquidDataCountriesViews.csv
SquidDataCountries2.csv

SquidReportArchive.pl

Arguments

Input

Csv files generated by WikiCountArchive.pl.

Output

Once a month generates a lot of reports from the csv files generated by SquidCountArchive.pl.
Page view reports are based on a 1:1000 sampled server log (squids) ⇒ all counts x 1000.
Page edit reports are identical to page view reports, but based on 1:1 unsampled squid log!

Reports come in two sets

Breakdown of traffic by one aspect, see this portal section.
- How many files are requested each day? Breakdown by file type and target (sub)project
- Where do those requests originate? Breakdown by file category and origin
- Where do those request land? Breakdown by file category and destination wiki
- Which HTTP requests are issued? Breakdown by type and results
- Which scripts are invoked? Breakdown by type (css, javascript, php), name and parameters
- Which skin files are downloaded and how often?
- Which crawlers access our servers? Breakdown by host, file types requested, agent string
- Which operating systems do our clients use? Breakdown by platform, mobile or not, release
- Which browsers are how popular? Breakdown by brand, revision level, mobile or not
- Breakdown of all traffic that Google send us. Directly (crawlers) and indirectly (search results)
Breakdown of Wikipedia views by geo info, see this sitemap.
Breakdown of Wikipedia edits by geo info, see this sitemap.

SquidCollectBrowserStatsExcel.pl

Quick script to collect browser stats for Excel chart, see example Excel chart

SquidLoadScan.pl

Read all files on squid log aggregator with hourly counts for

number of events received per squid
average gap in sequence numbers (this should be 1000 ideally on a 1:1000 sampled log)

Write several aggregations of these data:

Hourly average delta sequence numbers per squid (should be 1000 in 1:1000 sampled data set, if no UDP loss occurred)
Monthly averages of events per squid per hour
Monthly metric per squid set (set defined by location and server role)

This leads to the folowing report: udp2log message loss, and traffic volume, based on 1:1000 sampled squid logs

Migrate / Replace

Monthly wikistats page view trend reports, indirectly based on squid logs

Monthly wikistats traffic reports (geo & traffic type), based on un|sampled squid logs

SquidCountArchive.pl

Modules

Phases

Arguments

Input

Output

SquidCountryScan.pl

SquidReportArchive.pl

Arguments

Input

Output

SquidCollectBrowserStatsExcel.pl

SquidLoadScan.pl

See also