User:BDavis (WMF)/Projects/Structured logging
This page is currently a draft.
|
This is a collection of ideas about Structured logging in MediaWiki and related software projects. For the purposes of this discussion "logging" refers to wfErrorLog()
and it's related family of functions and not audit logs stored in the database by Special:Log or similar systems.
Problem
[edit]Logging in MediaWiki (and WMF in general?) is optimized for human consumption. This works well for local development and testing and can scale to management of a small to mid size wiki depending on the number of eyes applied to the logs and the acumen of the operational support personnel with grep, cut, sed, awk and other text processing tools.
Unfortunately what works well for a single developer doesn't work as well for analyzing the log output of a large production site.
- need better tools for wider audience
- aggregation
- de-dup
- cross system correlation
- alerting
- reporting
This is not a wholly new idea. Let's look at what's out there and see if we can find a solution or at least borrow the best bits.
Current logging
[edit]- wfDebug( $text, $logonly = false )
- Logs developer provided free-form text + optional global prefix string
- Possibly has time-elapsed since start of request and real memory usage inserted between prefix and message
- Delegates to
wfErrorLog()
- wfDebugMem( $exact = false )
- Uses
wfDebug()
to log "Memory usage: N (kilo)?bytes"
- wfDebugLog( $logGroup, $text, $public = true)
- Logs either to a custom log sink defined in
$wgDebugLogGroups
or viawfDebug()
- Default
- Prepends "[$logGroup] " to message
- Custom sink
- May log only a fraction of occurrences via
mt_rand()
sampling - Prepends
wfTimestamp( TS_DB ) wfWikiID() wfHostname():
to message - Delegates to
wfErrorLog()
to actually write to sink
- May log only a fraction of occurrences via
- wfLogDBError( $text )
- Enabled/disabled with
$wgDBerrorLog
sink location - Logs "$date\t$host\t$wiki\$text" via
wfErrorLog()
to$wgDBerrorLog
sink - Date format is
'D M j G:i:s T Y'
with possible custom timezone specified by$wgDBerrorLogTZ
- wfErrorLog( $text, $file )
- Writes
$text
to either a local file or a UDP packet depending on the value of$file
- UDP
- If
$file
ends with a string following the host name/IP it will be used as a prefix to$text
- The final message with optional prefix added will be trimmed to 65507 bytes and a trailing newline may be added
- If
- FILE
$text
will be appended to file unless the resulting file size would be >= 0x7fffffff bytes (~2G)
- wfLogProfilingData()
- Delegates to
wfErrorLog()
using$wgDebugLogFile
sink - Creates a tab delimited log message including timestamp, elapsed request time, requesting IPs, and request URL followed by a newline and the profiler output.
- Date is from
gmdate( 'YmdHis' )
Structured logging standards
[edit]Graylog Extended Log Format (GELF)
[edit]From http://www.graylog2.org/about/gelf:
"The Graylog Extended Log Format (GELF) avoids the shortcomings of classic plain syslog:
- Limited to length of 1024 byte - Not much space for payloads like backtraces
- Unstructured. You can only build a long message string and define priority, severity etc."
Structured logs messages sent via UDP as gzip'd json with chunking support.
- client libraries in PHP, Ruby, Java, Python, etc
- native input for graylog, plugin for logstash
- version
- GELF spec version ("1.0")
- host
- hostname sending message
- short_message
- short descriptive message
- full_message
- long message (eg backtrace, env vars)
- timestamp
- UNIX microsecond timestamp
- level
- numeric syslog level
- facility
- event generator label
- line
- source file line
- file
- source file
- _<whatever>
- sender specified field
Metlog
[edit]From https://wiki.mozilla.org/Services/Sagrada/Metlog:
"The Metlog project is part of Project Sagrada, providing a service for applications to capture and inject arbitrary data into a back end storage suitable for out-of-band analytics and processing."
3-tier architecture: generator (eg an app), router (logstash), endpoints (back end event destinations: statsd, Sentry, HDFS, Esper, OpenTSDB, ...)
- python and node.js client libs for formatting and emitting messages.
- logstash plugins for input processing and output to backends
- json serialized messages
- timestamp
- time message generated
- logger
- message generator (eg app name)
- type
- type of message payload
- (application defined; correlated with payload and fields)
- severity
- rfc5424 numeric code (syslog level)
- payload
- message contents
- fields
- arbitrary key/value pairs
- env_version
- message envelope version (0.8)
Common Event Expression (CEE)
[edit]From http://cee.mitre.org/:
"Common Event Expression (CEE™) improves the audit process and the ability of users to effectively interpret and analyze event log and audit data. This is accomplished by defining an extensible unified event structure, which users and developers can leverage to describe, encode, and exchange their CEE Event Records."
- Project dead due to funding loss, but some interesting artifacts left behind.
- Lumberjack (https://fedorahosted.org/lumberjack/) seems to be an equally dead implementation.
- CEE/Lumberjack provide taxonomies for the types of arbitrary key/value pairs described in Metlog and other basic structured log formats.
Other ideas
[edit]- Mapped diagnostic context (MDC)
- I used to have an academic paper describing this; I can't find it now
- Available in Java, ruby, lots of others.
- I wrote one for PHP: https://github.com/bd808/moar-log/blob/master/src/Moar/Log/MDC.php
- Timers and counters
- statsd everywhere
- Separate events from emitters
- "optimistic logging" - aggregate log events in ram, only emit if a "severe" message is seen
Proposal
[edit]Serialization Format
[edit]Rather than dive down a rabbit hole of trying to find a universal spec for log file formats let's just keep things simple. PHP loves dictionaries (well they call them arrays but whatever; key=value collections) and has a pretty fast json formatter. So the simplest thing that will work reasonably well would be to keep log events internally as PHP arrays and serialize them as json objects. This will be relatively easy to recreate on other internally developed applications as well with the possible exception of apps written in low level languages such as C that don't have ready made key=value data structures.
Data collected
[edit]Here's a list of the data points that I think we should definitely have:
- timestamp
- Local system time that event occurred either as UNIX epoch timestamp or ISO 8601 formatted string
date( 'c' )
- host
- FQDN of system where event occurred
php_uname( 'n' )
- source
- Name of application generating events; correlates to APP-NAME of RFC 5424
'Mediawiki'
- pid
- Unix process id, thread id, thread name or other process identifier
getmypid()
- severity
- Message severity (RFC 5424 levels)
'WARN'
- channel
- Log channel. Often the function/module/class creating message (similar to
wgDebugLogGroups
groups) get_class( $this )
- message
- Message body
"Help! I'm trapped in a logger factory!"
Additionally I would suggest that we add a semi-structured "context" component to logs. This would be a collection of key=value pairs that the developers determine to be useful for debugging. In the past I have found it useful to have two different methods available to add such data. The first is as an optional argument to the logging method itself and the second is a global collection patterned after the Log4J Mapped Diagnostic Context (MDC).
The local collection is useful for obvious reasons such as attaching class/method state data to the log output and deferring stringification of resources in the event that runtime configuration is ignoring messages of the provided level.
- file
- Source file triggering message
- line
- Source line triggering message
- errcode
- Numeric or string identifier for the error
- exception
- Live exception object to be stringified by the log event emitter
- args
- key=value map of method arguments
The global collection is very useful for attaching global application state data to all log messages that may be emitted. Examples of data that could be included:
- vhost
- Apache vhost processing request
$_SERVER['HTTP_HOST']
- ip
- Requesting ip address
$_SERVER['REMOTE_ADDR']
- user
- Authenticated user identity
- req
- Request ID; UUID or similar token that can be used to correlate all log messages connected to a given request
There are obvious privacy concerns with some of these data fields. Careful thought will need to be given to designing a system such that wikis can choose whether or not to log any of these fields.
Not collected (rejected fields)
[edit]- ua
- User agent
sess
- Application session ID (or a proxy token that can be used to correlate events connected to the session)
API
[edit]FIXME: Propose a PHP API for logging and a migration plan for existing wfErrorLog() calls
See Also
[edit]- http://gregoryszorc.com/blog/2012/12/06/thoughts-on-logging---part-1---structured-logging/
- https://journal.paul.querna.org/articles/2011/12/26/log-for-machines-in-json/
- http://carolina.mff.cuni.cz/~trmac/blog/2011/structured-logging/
- http://dev.splunk.com/view/logging-best-practices/SP-CAAADP6
- https://delicious.com/bd808/logging