This page is archived! Find up-to-date documentation at https://wikitech.wikimedia.org/wiki/Analytics

Almost everything in the big data world is Java; and because most of it was written by experienced engineers and seen real production deployment, the components all publish stats and controls via JMX. Thus a solid JMX monitoring solution would be great.

Research

Comparison of Network Monitoring Systems -- most of the solutions listed here won't do what we need, but it was my starting point for many of these links.

Backends / Full Stack

These should be full-stack monitoring applications, with client producer libraries, a server for data aggregation, a dashboard for reviewing data/interacting with JMX services, and configurable instrumentation alerts.

Graphite (source, shitty official website): "Graphite only knows two things: (1) how to store data; (2) how to render graphs of this data." JVM integration via jmxtrans (see below). There's also a metric shitton of tools -- from alternate dashboards to bridges to and from other monitoring systems/agents.
Ganglia (source): "Monitoring your clusters and grids since the year 2000." JVM integration via jmxetric, a polling agent that runs via -javaagent. Unfortunately, it requires explicit configuration for all mbeans you want to monitor (yes, including JVM internals). Big-ass XML file: ahoy!
Zabbix: I've used this before. It wasn't fantastic, pretty, or easy to use, but it did most of what it said. I found the source to be totally unhackable, as it's an ugly combination of C, PHP, and Java.
Ooyala's Hastur (client): I attended a talk on Hastur's architecture at the Cassandra Summit (video, slides), and I've wanted to play with it since then. If you're curious about the arch, check out those links. Unfortunately, while it provides both a client and a server, as far as I can tell it doesn't have a dashboard. I'm also unclear if it has a JMX adapter out of the box, but I believe I read somewhere it did.

Agents

These libraries might not provide all the features/components of an OTS package, but hopefully also provide less cruft and cleaner interfaces/better ideas.

jmxtrans: JMX agent library that aims to be the rosetta stone between the JMX protocol and various aggregation backends -- claiming support for "Graphite, StatsD, Ganglia, cacti/rrdtool, text files, and stdout".
Logz.io's jmx2graphite: A minimal configuration tool for polling JMX and write them to Graphite every X seconds. No installation required - run using "docker run" command and specify the basics: Jolokia URL, Graphite host and interval.
collectd: a system statistics collection daemon, with an impressive list of plugins.
Jolokia (source): Looks promising, but on further inspection, it does not have a storage/aggregation component. Instead it provides polyglot agents for data logging (with support for normal JMX (JSR-160), a JVM agent, Python's OSGI, and even an on-page JS agent).
Netflix's Servo: An application monitoring library. API looks great (using annotations!), awesome set of transforms/features, high quality code. Unfortunately, Netflix runs everything on EC2, so the library only supports CloudWatch out of the box.
Twitter's Ostrich: A stats collector & reporter for Scala servers. For when we inevitably start writing Scala code. It does appear to provide an aggregator/admin server, but I didn't spend much time looking at it.

Bad Ideas

Various historical entries on this page that we've since sobered up and removed.

Turmeric (source): I think this is more of a full-stack application platform (they call it a "policy-driven SOA platform") that incidentally provides monitoring of its services. Probably inappropriate for our needs, but it seems pretty interesting at least. See also: a related blogpost.

Thoughts

While Hastur sounds like it was a lot of fun to write, that's, uh, not a reason to run it? (What the hell was I thinking?) We should just go with Graphite or Ganglia. I personally prefer Graphite, as it has a way richer query/visualization API; digging into data is what this is all about. Ganglia is more mature, but in the world of monitoring it appears that "maturity" means "provides a user experience that increasingly approximates Windows 3.1". § dsc (talk) 19:47, 14 November 2012 (UTC)[reply]