How this project came about - adamvinueza/libtrace GitHub Wiki
My first exposure to observability was back in 2011, while I was working at CoreLogic. I'd written a simple logging library for our services, and over time the library ended up being used to observe our production systems: we put "call identifiers" into every public and most private API methods, so we could trace API calls from initial request to final response, from service to service, often drilling down into core libraries. We logged all kinds of stuff: incoming IP addresses, function names and parameters and return values, error messages and stack traces, memory usage, elapsed times, et cetera et cetera et cetera.
We did this because it was the only way we could understand what our services were doing in production: we discovered that they often behaved very differently there than in our development and test systems. Only production traffic was able to reveal that our indexes were optimized for only certain kinds of queries, or that some queries went to pull data off disk 800 times more often than other queries, or that a single minute-long query could cause other queries to wait 20 minutes, because the server had run out of threads to allocate.
Around the same time I came across Steve Yegge's old blog, and because he was super smart and super entertaining, I basically tore through every post in something like three days, and kept coming back to remind myself of something he wrote that I had realized was brilliant. And one of those remarks--I can't find the quote now, so you'll have to trust me--was that software monitoring caught more bugs than unit testing. I didn't know exactly what "software monitoring" was, exactly, but I had a strong feeling that what we were doing in our own services was basically that, and felt validated.
Still, there were some big problems. The biggest one was that finding the source of a problem often required hours, and sometimes days, of sifting through giant piles of log messages with complex multiply-piped grep
commands. We could see our system's behavior in real time, but watching it behave often felt like watching the squiggly characters scroll up screens in The Matrix. The very real danger was clear: we were likely missing lots of problems because they were buried in so much noise they passed by unnoticed.
So by 2015 I knew that something had to change in how I was thinking about monitoring software systems. And it was around then that I came across this odd person on Twitter named Charity Majors. She used to work at Facebook, had started a company with the funning-sounding name "Honeycomb", and couldn't stop ranting about this thing she called "observability". Oh, and she liked saying what struck me at the time as utterly heretical things: stuff like "TEST IN PRODUCTION OR LIVE A LIE" and "NO-DEPLOY FRIDAYS ARE FOR COWARDS". (I'm paraphrasing, but she'd back me up.) Also she had weird rainbow hair and tattoos, and cursed like a real human, and seemed Just So Done with all the bullshit I'd spent so much of my career believing was part of the job. TL;DR I was ready to listen to whatever she had to say.
And what she had to say was that while I'd been building observable systems for half my career, I could be doing it a whole lot better. For one thing, logging was a very different thing from observing. What I was trying to do with my primitive logger spitting out messages all over the place was capture events. I was building a mental model of a system's state by sifting through descriptions of snapshots of that state. What if I could use software to build the model itself? Also, if I don't want to drown my logs with noise, I could sample events instead of logging every single one.
These were great ideas!
So how was I to go about implementing them? Well, I wanted our team to adopt Honeycomb for observability and re-instrument our services, but there was no way I would get approval for doing that. Luckily, a couple of years later I got laid off from CoreLogic in yet another one of their "restructurings", so I was free to take my observability preoccupation to another company. After trying (and failing) to get it adopted at new job, and while searching for my next one, I decided to figure out how hard it would be build the simplest possible observability infrastructure, so I wouldn't have to persuade anyone to risk going to a third-party service. It would be my own stupid little open-source library that could be used as a proof-of-concept, and it would be easy to replace with something real if the company decided to invest.
Also, building my own observability infrastructure would teach me new things. Reading Honeycomb's documentation on its tracing and events libraries and scrutinizing its source code revealed a few super-cool features I knew very little about, such as using Python decorators to streamline aspect-oriented programming, and placing process-level variables into initialization code to make it easy to pass state around invisibly. This, too, was great stuff, and I was eager to start learning it.