"It has been said critically that there is a tendency in many armies to spend the peace time studying how to fight the last war" - Lieut. Col. J. L. Schley, 1929
Background
I've been involved with OpenTelemetry off and on for about the past eight months, partially because my employer is a vendor but mostly because it's an interesting Open Source project that I'd like to see gain adoption. And so this is my attempt to organize my thoughts on the project so far, and hopefully also act as a resource for anyone interested in the subject. These are my personal views.
Why is OpenTelemetry Popular?
Listen: There are lots of folks who care about OpenTelemetry. It's hard to believe, but OpenTelemetry had a ton of eyeballs in 2020 for an OSS project that's not GA, not widely used in production, and frankly, kind of boring. It's kind of just Structured Logs, after all, not cool AI Automation. And yet, stubbornly, OpenTelemetry was popular in 2020, becoming the 2nd Most active CNCF Project (behind only Kubernetes). Why was that?
Well, maybe because it has lofty goals to unify Tracing, Metrics, and Logs Telemetry standards, which impacts just about every software developer. And, it does have official SDKs in a lot of languages. But probably the real reason for it's "popularity" is because a bunch of Vendors and Cloud Providers have all "bet" on OpenTelemetry, to varying degrees, having decided it's better to have one generally accepted telemetry standard than it is to maintain a bunch of bespoke ones. And they're not shy about it, with Dynatrace highlighting it in their last earnings call, and AWS giving it love at re:Invent and beginning to bake it into their existing monitoring stack, to name just a few examples. And users have taken note.
With OTEL's Tracing Spec at 1.0, many SDKs at 1.0 or an RC, and focus now turning to Metrics and Logging stability, the discourse around OpenTelemetry has been almost uniformly positive. Look, shiny auto instrumentation! Look, Open Source! Look, No Vendor Lock In! And, that's all great and valid. But I think it's ok, and healthy, to be critical (constructively) too. So I want to ask not just "What does OpenTelemetry do well?", but also understand "Where can OpenTelemetry be better?".
What OpenTelemetry Does Well
Like any good performance review, before criticising you should praise. So, what does OpenTelemetry do well right now? I think OpenTelemetry does A LOT really well, but the way I'd sum it up is to say that OpenTelemetry currently fights the last war very well. "The last...wut?” You say? Well, in the military context, it means tactics that would’ve worked great before but might not be on the cutting edge today or tomorrow. So, building a Castle despite the invention of Cannons, for example. Or, The Charge of the Light Brigade, for the poetry inclined.
In the context of software, I mean that OpenTelemetry, as it exists today, does a great job solving yesterday’s problems. And what were those problems? Well, you had an application where you wanted to understand where requests were getting bottlenecked or causing errors, so you needed tracing of application requests. And apps and systems that were more than just a monolith and you wanted to see how they were connected, so you needed trace context propagation across boundaries. And you had polyglot systems, so you needed Automatic Instrumentation of popular libraries in every language. And you needed to sample and aggregate and filter the data you collected, so you needed some sort of collection daemon or processing pipeline to apply to your data. And so on and so forth.
These were all the things that Jaeger, and Zipkin, and OpenTracing/OpenCensus, and then the vendor implementations and extensions, attempted to address. They had their tradeoffs and their specs and some things worked well and others not so well, and all the approaches were generally non-interoperable or conflicting and it was a huge pain to migrate from one solution to another or even compare them.
So OpenTelemetry, at Tracing 1.0, has solved basically all the above, or has a clear roadmap to solving them. You get well thought out standards, conventions, and specifications. You get distributed tracing that passes along context between apps, and has compatibility with every other kind of distributed tracing out there. You get automatic instrumentation in a bunch of languages that's got consistent semantic conventions, structure, and metadata. You've got a “Agent/Collector” that does all the heavy lifting of aggregation and filtering and queueing and retrying and handling API Keys and so on. There's modularity and config options that let you deploy in different environments or with customizations, and rely on consistent environment variable naming. And, perhaps most importantly, is you have an inclusive approach to existing standards, so if you've already got some Zipkin instrumented applications and some vendor instrumented applications and so on, it's mostly all portable to OpenTelemetry and can get translated by the Collector, and exported to any backend, even multiple backends at the same time with the same data. It's as simple as changing a YAML file.
This is all awesome work and is nothing to scoff at. Now you have something completely Open Source that there are folks still getting Billion Dollar Valuations for just getting around to doing. And it’s not just a benefit to end users, but it offers immediate ROI to many existing vendors too, who had previously been relying on poorly maintained community tracing libraries in certain languages, or had been forced to fork competitor's libraries. But it leads into the second question worth asking here, and that is: Where can OpenTelemetry be better? If everything I rambled about above was "the last war" then, what's the current one, what's the next one? What's the thing that investing dev cycles in today will prove to be a worthwhile investment and helpful for teams for the next few years?
Where OpenTelemetry Can Be Better
And I think the answer is...Observability. It's a term, if not invented, certainly popularized by Charity Majors , and then quickly adopted (Co-Opted? Kidnapped? Hijacked?) by the broader Monitoring community, but it does have meaning. In practice it means being able to make inferences about your system from it's outputs, or basically, being able to ask arbitrary questions about your system. So this means being able to take these different Telemetry sources: Metrics, Traces, or Logs, and investigate them in the context of one cohesive event. Like being able to query a metric derived from your trace (latency, for example) sorted by a metadata field you'd collected but didn't know would be especially important ahead of time (a deployment version or a user's payment type, for example). It means being able to correlate your shiny new web frameworks traces with dusty old Nginx logs and your plain jane statsd metrics. And right now, OpenTelemetry could be a lot better at Observability.
For starters, the work of storing and querying your telemetry data, which is essential for Observability Tooling, has so far been punted on as part of OpenTelemetry, with the common answer here being, essentially "Export to a vendor!" and hoping they do a good job at a fair price. And while a number of vendors do a great job, this somewhat defeats the rallying cry of "No Vendor Lock In", because well, you're still locked in to needing to use a vendor, you just have more flexibility around which one. As an OSS Project it would be nice to know that there's a true OSS option as well. At the moment, there's the start of some really promising Open Source tools emerging for trace storage+querying, like Tempo. But, despite Grafana being a dyed in wool OSS company, Tempo is distinct from OpenTelemetry, and frankly, still quite new. As far as I know, it only allows query by trace ID. This means that you're essentially accessing your traces only by jumping to them from either:
Logs, which would include the Trace Context injected into them if they're correctly structured. Or,
Metric Exemplars, Metric Aggregation that include sample data points, like trace ids.
Additionally, even though the OpenTelemetry Tracing Specification is 1.0, Logging (including trace-id injection into logs) and Metrics (including Exemplar support) are still either largely experimental and subject to change or, completely unimplemented in some languages. Without these key details, the correlation of your Telemetry Data into one cohesive event with just OpenTelemetry tooling is not particularly well supported across all languages, and by extension, Observability still relies largely on your vendor to bring you across the finish line by deriving+aggregating metrics associated with spans and their metadata, correlating logs with trace-context at log write time, and giving you the flexibility to query your spans directly by their metadata. It would be great to see OpenTelemetry give users true flexibility to not just say "I'll switch from vendor A to vendor B", but even say "I can achieve Observability on my own".
Are there other things I'd like to see OpenTelemetry tackle? Sure. I think it would be nice to see more library authors bake tracing in natively using OpenTelemetry APIs, like Spring Cloud Sleuth has started to do. I think there are unsolved problems around how to track and represent Queue based-systems or Background Jobs intuitively in distributed tracing, especially in Serverless environments which depend heavily on tools like SNS/SQS/Kinesis. I think the footprint and performance tradeoffs of some OpenTelemetry libraries still leaves much to be desired. I think the never-ending changing in signalling of what is "Stable" vs "Experimental" vs what is "GA" is too fluid and might end up pulling defeat from the jaws of victory when it comes to OpenTelemetry adoption. And so on. But largely these are all more minor issues. What really matters, I think, is making sure that the Roadmap stays focused on Observability. If so, everything else will work itself out.
So that's where I'd love to see OpenTelemetry improve in 2021, by not just fighting the last war, but by looking ahead to give users ways to solve tomorrow's challenges. To be clear, so far it seems like, yes, we'll get there. The roadmap is there, certainly. But it is going to take work, and it will be interesting to see whether, now that some vendors have gotten all they want out of the project (subsidized a need to maintain a bunch of Tracing SDKs in a bunch of languages, and made it easier to for users to trial their SaaS backend vs another competitor), they continue to commit time and resources to ensure that Metrics and Logging reach the same maturity that the Tracing Specification has. I guess we’ll see.
Thanks for reading,
Eric Mustin (@ericmustin)