Analytics/Archive/Pixel Service
This page is obsolete. It is being retained for archival purposes. It may document extensions or features that are obsolete and/or no longer supported. Do not rely on the information here being up-to-date. |
This page is archived! Find up-to-date documentation at https://wikitech.wikimedia.org/wiki/Analytics
The Pixel service is the "front door" to the analytics system: a public endpoint with a simple interface for getting data into the datastore.
Components
[edit]- Request Endpoint: HTTP server that handles
GET
requests to the <blink>PIXEL</blink> service endpoint, responding with204 NO CONTENT
or an actual, honest-to-god 1x1 transparent gif. Data is submitted into the cluster via query parameters. - Messaging System: routes messages (request information + data content) from the request endpoint to the datastore. This component is intended to be implemented by Apache Kafka.
- Datastore Consumer: consumes messages, shunting them into the datastore utilizing HDFS staging and/or append.
- Processing Toolkit: a standard template for a pig job to process (count, aggregate, etc) event data query string params, handling standard indirection for referrer and timestamp, Apache Avro de/serialization, and providing tools for conversion funnel and A/B testing analysis.
- Event Logging Library: a JS library with an easy interface to abstract the sending of data to the service. Handles event data conventions for proxied timestamp, referrer; normal web request components.
Service prototype
[edit]To get up and running right away, we're going to start with an alpha prototype, and work with teams to see where it goes.
/event.gif
on bits multicast stream -> udp2log (1:1) running in Analytics cluster- Until bits caches are ready, we'll also have a publicly accessible endpoint on
analytics1001
- Until bits caches are ready, we'll also have a publicly accessible endpoint on
- Kafka consumes udp2log, creating topic per product-code -- no intermediate aggregation at cache DC
- Cron to run Kafka-Hadoop consumer, importing all topics into Hadoop to datetime+producer-code paths
EventLogging Integration TODOs
[edit]- Make sure all event data goes into Kraken (I think it may only be esams at the moment, not sure). [ottomata] (Dec)
- Divvy up some TODOs with Ori:
- Keeping udplog seq id counters for each bits host and emitting some alert if gaps detected
- Until https://rt.wikimedia.org/Ticket/Display.html?id=4094 is resolved, monitor for truncated URIs (detectable because missing trailing ';') and set up some alerting scheme
- Speaking of that RT ticket: check w/Mark if we can do something useful to move that along (like update the patch so it applies against the versions deployed to prod).
- Figure out a useful arrangement for server-side events (basic idea: call wfDebugLog(..) on hooks that represent "business" events, have wfDebugLog write to UDP / TCP socket pointing at Kraken. See EventLogging extension for some idea of what I mean.
- already done? EventLogging's efLogServerSideEvent() validates events against a versioned schema on meta-wiki and writes them using wfDebugLog (currently to UDP). E3 logs all AccountCreation events on all servers using this. -- S Page (WMF) (talk) 00:39, 12 January 2013 (UTC)
- Things Ori needs and would repay in dev time and/or sexual favors: - Puppetization of stuff on Vanadium - Help w/MySQL admin
- Other EventLogging TODOs: mw:Extension:EventLogging/Todos
- Figure out how to map event schemas to Avro(?) or some other way to make Hadoop schema-aware so the data is actually useful rather than just blob-like
Getting to production
[edit]We're pretty settled on Kafka as the messaging transport, but to use the dynamic load-balancing and failover features we need a ZooKeeper-aware producer — unfortunately, only the Java and C# clients have this functionality. (This is a blocker for both the Pixel Service AND general request logging.)
Three options:
- Pipe logging output from Squid & Varnish into the console producer (which implies running the JVM in production);
- Write code (a Varnish plugin plus configuration as described here, as well as a Squid module, both in something C-like) to do ZK-integration and publish to Kafka
- Continue to use udp2log -> Kafka with the caveat that the stream is unreliable until it gets to Kafka.
Frequently Asked Questions
[edit]What HTTP actions will the service support?
[edit]GET
.
What about POST
s?
[edit]No POST
. Only GET
. Other than content-length, there's no real justification for a POST
, and if you're sending strings that are greater than 2k, you kind of already have a problem.
Can I send JSON?
[edit]Sure, but we're probably not going to do anything special with it -- the JSON values will show up as strings that you'll have to parse to aggregate, count, etc. Ex: GET /event.gif?json=={"foo":1,"bar":[1,2,3]}
(and recall you'll have to encodeURIComponent(json)
).
As we want to build tools to cover the normal cases first, this is not really recommended. (Just use www-form-encoding
KV pairs as usual.) If anyone has a REEEEALLY good use-case, we can talk about having a key-convention for sending a json payload, like, say, calling the key json
.
If I send crazy HTTP
headers, will the service record them?
[edit]No. We will not parse anything other than the query string.
Custom headers are exactly what we want to avoid -- think of the metadata in an HTTP request as being an interface. You want it to be minimal and well-defined, so little custom parsing needs to occur. KV-pairs in the query string are both flexible and generic enough to meet all reasonable use-cases. If you really need typing, send JSON as the value (as mentioned above).
See also
[edit]- Extension:EventLogging from the E3 team uses a similar approach to the Kraken Pixel Service: client-side JavaScript makes GET requests to
bits.wikimedia.org/event.gif?param1=value1...
. As of December 2012 mw:Onboarding new Wikipedians and Community portal redesign use this extension and its JSON schema-driven logging, and Extension:MobileFrontend directly makes requests to the event.gif- Meta-wiki hosts the schemas definining the events they log.
- Extension:ClickTracking from 2010 implements (among other features) event logging via HTTP requests to a MediaWiki API that writes to a ClickTracking "log" which we route over UDP. As of December 2012 several extensions still depend on ClickTracking, but few actually generate log events.