Analytics/Wikistats/TrafficReports
Migrate / Replace
[edit]For a discussion on the future of these reports, see Analytics/Wikistats/TrafficReports/Future
Monthly wikistats page view trend reports, indirectly based on squid logs
[edit]- Platform
- wikistats server stat1002
- Update frequency
- daily
- Input
- hourly per-wiki page view totals (files named projectcounts-yyyymmdd-hhnnss) ... produced by webstatscollector downloadable from dump server
- Output
- csv file for monthly Report Card
- daily updated monthly page view stats. These reports are available in 4 variations: for desktop and mobile traffic, normalized (all months recalculated to 30 days) or original. There are reports for every project and a report for all project combined. See sitemap.
Monthly wikistats traffic reports (geo & traffic type), based on un|sampled squid logs
[edit]- Update frequency
Monthly (manual run)
- Scripts location
github, dev/beta on squid1002:/home/ezachte/wikistats/squids
- Input
stat1002:/a/squid/archive/sampled
SquidCountArchive.pl
[edit]Collects a large variety of data from a/squids/archive/sampled in two passes,
- updates a few monthly files (in folder /a/wikistats_git/squids/csv/yyyy-mm), and
- creates a large set of daily csv files in folder (../yyyy-mm/yyyy-mm-dd), subfolders ../private and ../public.
Modules
[edit]- EzLib.pm (now at EZ ~/lib)
- ../SquidCountArchiveProcessLogRecord.pm
- ../SquidCountArchiveReadInput.pm
- ../SquidCountArchiveWriteOutput.pm
Phases
[edit]- Pass/Phase 1: collect frequencies for all ip addresses, needed in phase 2 to filter addresses which are most likely from bots (freq > 1 in sampled squid log stands for 1000 real views/edits, esp. for edits this most likely means a bot (few false positive are accepted)
- Pass/Phase 2: collect all other counts
Arguments
[edit]- -t for test mode (run counts for short timespan)
- -d date range, can be absolute or relative, specify one value for a single day, two values for a range of days
- -d yyyy/mm/dd[-yyyy/mm/dd] (slashes are optional). This is best for generating counts for a specific day or period.
- -d -m[-n] where m and n are number of days before today. This is best for daily cron scheduling (e.g. -d 1-7 => run for last 7 days before today, skip days for which output already exists).
- -f [1|2|12|21], for force rerun phase 1 and/or 2, even when that phase completed successfully earlier
- phase 1: check for existence SquidDataIpFrequencies.csv.bz2 in ../yyyy-mm/yyyy-mm-dd
- phase 2: check for existence #Ready in ../yyyy-mm/yyyy-mm-dd
Input
[edit]- 1:1000 sampled squid log files from stat1002:/srv/squid/archive/sampled
- Format
- see also wikitech
- 1. Hostname
- 2. Sequence number
- 3. Current time in ISO 8601 format (oplus milliseconds), according ot the squid server's clock
- 4. Request time in ms
- 5. Client IP
- 6. Squid request status, HTTP status code
- 7. Reply size including HTTP headers
- 8. Request method (GET/POST etc)
- 9. URL
- 10. Squid hierarchy status, peer IP
- 11. MIME content type
- 12. Referer header
- 13. X-Forwarded-For header
- 14 User-Agent header
On file read country codes are added from local copy of MaxMinds free GeoLite Country service as follows:
open IN, "-|", "gzip -dc $file_in | /usr/local/bin/geoiplogtag 5" ; (author geoiplogtag Mark Bergsma), 5 means: 5th field contains ip address
Output
[edit]- Most files are written to sub folder ../yyyy-mm/yyyy-mm-dd
- SquidDataLogfileHeadTail.csv and SquidDataSequenceNumbersAllSquids.csv are reused between runs and updated in sub folder ../yyyy-mm
- Notes
- Files marked as private file can contain data which are covered by the Wikimedia Privacy Policy, such as ip addresses and/or referers. These files should never be published! Also these should be deleted after the maximum retention period, set by our privacy policy.
- Count is number of occurences in one day sampled log, for actual traffic counts scale by factor 1000
- Country codes are as used by MaxMinds free GeoLite Country service
- There are several reasons why a requester is classified as bot
- agent string contains url, which by convention should only occur for crawlers (exception: Embedded Web Browser from: http://bsalsa.com/)
- name contains bot/crawler/parser/spider
- for edits only: ip address occurs more than once in sampled log on one day (few false positives accepted)
- #Ready
- This file signals successful completion of script's phase 2: collecting all counts except ip addresses.
- On rerun phase 2 will not be redone when this file exists, except when argument -f 2 (force) is specified.
- SquidDataAgents.csv
- Free format agent strings sent by browser
- Agent string
- Count
- SquidDataBinaries.csv
-
- Image file name
- Count
- SquidDataClients.csv
-
- Category
- - Desktop client
- M Mobile client
- E Engine (if Gecko or AppleWebKit)
- G,- Group (higher aggregation level of desktop clients)
- G,M Group (higher aggregation level of mobile clients)
- Browser (client) brand and version
- Share of total within category
- Category
- SquidDataClientsByWiki.csv
-
- Category
- - Desktop client
- M Mobile client
- Client (~ browser) brand and version
- Project (wikipedia, etc)
- Language code
- Share of total within category
- Category
- SquidDataCountriesViews.csv
-
- Bot (Y/N) (see notes)
- Project/Language (e.g. Wikipedia English = 'wp:en')
- Country code (as used by MaxMind)
- Count
- SquidDataCountriesSaves.csv
-
- Bot (Y/N) (see notes)
- Project/Language (e.g. Wikipedia English = 'wp:en')
- Country code (as used by MaxMind)
- Count
- SquidDataCountriesViewsTimed.csv
-
- Bot (Y/N) (see notes)
- Project/Language (e.g. Wikipedia English = 'wp:en')
- Country code
- Count
- SquidDataCrawlers.csv
- SquidDataEditsSaves[..etc..].txt.bz2 (private file)
- Capture all records describing edit/submit events for later analysis (will be obsolete when 1:1 non-sampled capture will be reactivated)
- SquidDataExtensions.csv
-
- File extension
- Count
- SquidDataGoogleBots.csv
- Visits by googlebot from known Google ip addresses (note: list need manual update every now and then!)
- Date
- Ip range (3 triplets)
- Count
- SquidDataImages.csv
-
- File size range (per 20 bytes)
- Count
- SquidDataIndexPhp.csv
- Counts for edits and submits
- Bot (Y/N) (see notes)
- Project/Language (e.g. Wikipedia English = 'wp:en') target wiki
- Project/Language (e.g. Wikipedia English = 'wp:en') referring wiki
- Squid status code e.g. 'TCP_MISS/302'
- Mime type e.g. 'text/html'
- Arguments (with values for some parameters blocked out) e.g. 'action=edit&title=..'
- Count
- SquidDataIpFrequencies[..etc..].csv (private file)
- Generated in pass/phase 1, used a.o. to flag multiple edits from same IP address on one day as bot edits (input is 1:1000 sampled file, few false positives accepted).
Also some distribution stats: x addresses occur y times, x addresses occur y+ times- Count
- Ip address
- SquidDataLanguages.csv
- For Chrome, Firefox, Opera, Safari
- Browser name
- Language code
- Count
- SquidDataLogFilesHeadTail.csv
- Caches timestamps found in first and last record from sampled log files /a/asquids/archive/sampled-1000.log-yyyymmdd.gz
- Collecting these from a compressed log file is slow
- SquidDataMethods.csv
-
- Action e.g. GET
- Status e.g. TCP_MISS/200
- Count
- SquidDataOpSys.csv
-
- Category
- - Desktop client
- M Mobile client
- G Group (higher aggregation level)
- OS name + version
- Count
- Share of total (- and M together 100%, G separately)
- Category
- SquidDataOrigins.csv
-
- source (internal or external)
- domain top level (external e.g. 'yahoo', internal e.g. 'wp:en')
- domain e.g. '.com'
- file type: image, (html) page, other (e.g. css)
- SquidDataReferers[..etc..].txt (private file)
- mainly for debugging (see comments in file)
- SquidDataRequests.csv
- (see also comments in file)
- Project e.g. '@wp:en' for Mobile English Wikipedia
- Referer e.g. 'www.google.ca'
- Extension e.g. 'gif'
- Mime type e.g. 'text/html'
- Arguments (values for most argument omitted, arguments sorted alphabetically for better grouping), e.g. 'query=..&search=..&x=..&y=..'
- Count
- SquidDataScripts.csv
-
- Type: css, js or php
- Script name e.g. 'print.css'
- Arguments (values for most argument omitted, arguments sorted alphabetically for better grouping), e.g. 'query=..&search=..&x=..&y=..'
- Count
- SquidDataSequenceNumbersAllSquids.csv
- Ideally, if no UDP messages are lost, average distance between squid sequence numbers should be 1000 (1:1000 sampled log)
- This file can detect mass message log and help to aurocorrect counts
- Date
- Hour
- Events
- Avg distance between sequence numbers
- SquidDataSequenceNumbersPerSquidHour.csv
- Ideally, if no UDP messages are lost, average distance between squid sequence numbers should be 1000 (1:1000 sampled log)
- Squid id e.g. 'amssq31.esams.wikimedia.org'
- Hour
- Events
- Tot distance between sequence numbers
- Avg distance between sequence numbers
- SquidDataSkins.csv
-
- Skin file path e.g. 'skins-1.5/common/ajax.js'
- Count
SquidCountryScan.pl
[edit]Collect per country page view stats for all months
- Views:
- starting in July 2009 for page views
- input in stat1002:/wikistats_git/squids/csv
- Edits:
- starting in November 2011 for page edits
- input in stat1002:/wikistats_git/squids/csv_edits
- Output
- input for SquidReportArchive.pl to produce breakdown of Wikipedia views/edits by geo info
- SquidDataVisitsPerCountryMonthly.csv
- SquidDataVisitsDaily.csv
- SquidDataVisitsPerCountryPerWikiDaily.csv
- SquidDataVisitsPerCountryPerProjectDaily.csv
- SquidDataVisitsPerCountryDailyDetailed.csv
- SquidDataCountriesViews.csv
- SquidDataCountries2.csv
SquidReportArchive.pl
[edit]Arguments
[edit]Input
[edit]Csv files generated by WikiCountArchive.pl.
Output
[edit]Once a month generates a lot of reports from the csv files generated by SquidCountArchive.pl.
Page view reports are based on a 1:1000 sampled server log (squids) ⇒ all counts x 1000.
Page edit reports are identical to page view reports, but based on 1:1 unsampled squid log!
Reports come in two sets
- Breakdown of traffic by one aspect, see this portal section.
- How many files are requested each day? Breakdown by file type and target (sub)project
- Where do those requests originate? Breakdown by file category and origin
- Where do those request land? Breakdown by file category and destination wiki
- Which HTTP requests are issued? Breakdown by type and results
- Which scripts are invoked? Breakdown by type (css, javascript, php), name and parameters
- Which skin files are downloaded and how often?
- Which crawlers access our servers? Breakdown by host, file types requested, agent string
- Which operating systems do our clients use? Breakdown by platform, mobile or not, release
- Which browsers are how popular? Breakdown by brand, revision level, mobile or not
- Breakdown of all traffic that Google send us. Directly (crawlers) and indirectly (search results)
- Breakdown of Wikipedia views by geo info, see this sitemap.
- Breakdown of Wikipedia edits by geo info, see this sitemap.
SquidCollectBrowserStatsExcel.pl
[edit]Quick script to collect browser stats for Excel chart, see example Excel chart
SquidLoadScan.pl
[edit]Read all files on squid log aggregator with hourly counts for
- number of events received per squid
- average gap in sequence numbers (this should be 1000 ideally on a 1:1000 sampled log)
Write several aggregations of these data:
- Hourly average delta sequence numbers per squid (should be 1000 in 1:1000 sampled data set, if no UDP loss occurred)
- Monthly averages of events per squid per hour
- Monthly metric per squid set (set defined by location and server role)
This leads to the folowing report: udp2log message loss, and traffic volume, based on 1:1000 sampled squid logs