http://www.charleshooper.net/blog/painless-instrumentation-o...
Be wary of measuring - and hence improving - the wrong thing.
Sometimes you optimize for the wrong metric. The classic example is measuring programmer output by lines of code. The are many more subtle ways this can manifest, though.
So, measure everything, including your measures.
Clickable link: https://www.hostedgraphite.com
(I'm one of the cofounders)
I got graphite going on an Ubuntu installed on my macbook while in the air between SFO and ATL.
Literally, discovered it, downloaded it, and graphing metrics before we landed.
Now, that was, I think, in 2010. The last time I installed Graphite was a few months ago, and it seems like they have split the project up a bit. It took a bit longer.
NB: I'm neither a Python or Django 'guy'
For anyone interested, I used wrote node.js process takes arbitrary statsd-compliant data point and serves a socket.io enabled front-end for 'zero-config', realtime graphing.
We've found it useful internally for taking quick measurements on various projects. I was going to productize or open source the whole thing, but then life got in the way. Maybe it will see the light of day someday.
We also believe in measuring everything you can. We're interacting with many APIs across many boxes. Statsd + graphite are the tools we use to understand what's happing at runtime.
Graphite has a lot of warts, but it's really powerful once you get used to it. There are plenty of pretty interfaces you can put over graphite, but nothing really matches it for ease of ad-hoc queries.
Typically I'll use graphite to view ad-hoc metrics and build reports. When I find I'm repeatedly viewing a particular graphite report then I'll "hard-code" it in gdash [1] for the rest of the team.
We use this combo to track thousands of separate metrics and we've been pretty happy with it so far.
Implementation was easy. statsd is pretty simple to deploy and graphite wasn't too difficult either. To add statsd reporting to your code, it's essentially one line to create the statsd socket, another line of code to declare each timer or counter, and another one to increment. I think more time was spent determining what name to give each metric than it was implementing it in this project.
Now that I'm at dotCloud, I'm working with a much larger distributed system and we use it here also. We liked it enough to build some statsd hooks onto our RPC layer we use for just about everything. Now every time a component makes a remote procedure call, a counter for that call is incremented and the response time is sent to statsd. It's been very useful for troubleshooting odd behaviors and correlating events across the platform.
As people who work with complex distributed systems, we can't know exactly what they're doing. We'll think we know, and sometimes we'll be close. Other times we'll think we know, and then we'll wake up at 2AM because something failed horribly. By being able to monitor the system's behavior (sometimes in gross detail), we can get a little closer to knowing what's really going on.
If you're a Redis person, 37signals built a StatsD compatible version using EventMachine which stores data in Redis.