Thoughts on Oddly Specific Versioning in Observability

Or, what matters most in 2025

Dec 30, 2024

I think the Observability 2.0 and Observability 3.0 discourse is for Architecture Astronauts. While they offer directionally correct ways of thinking long term, I don't believe it's a practical concern for most folks in 2025. Cart before the horse, as they say.

To be clear, I don't think anyone is wrong, really. More like, oddly specific? I get the gist. To Observability 2.0's point, yup, there are preferred end to end implementations which make it easy (trivial, even) to ask the sort’ve Apache Datafusion/Arrow/Parquet type questions of your systems’ Observability Signals, providing answers which help you understand why things are misbehaving, rather than merely notifications when it is misbehaving. And I think, to Observability 3.0's point, using these Observability Platforms to actually act effectively, for a wide variety of Personas (including non engineering roles), is a good north star. Discussing the right way to do all that, the ideal way to interact end-to-end with that "Single Braid of Observability Data", as Ted Young once called it, those famous canonical-structured-trace-log-event-spans-with_baggagey_stuff, or what have you, is a useful exercise. But, again, I'm just not convinced that's the pressing issue for most people, at least in 2025.

Who are “most people”? I don't know. What am I, some sort of thought-leader? But anecdotally, as someone with a modest amount of experience, I think “most people” are like me, and just want table stakes monitoring tools, everywhere, easily, affordably. To repurpose Mr. Young's analogy...Most folks want *tight* braids of data. And over-focusing on how to layer on top the Observability 2.0, 3.0, Observability: The College Years functionality, features, and mindset is a secondary concern.

Is that Observability 1.0 still? Ok, fine. You don't necessarily need to do everything everywhere all at once, You just want a set of common things easily. Notify you when services are not reliable enough, let you look at visualizations of those services, let you query services by common dimensions and attributes, the majority of which are not excessively high cardinality. Help you collect the data you need easily and don't break your software in the process, help you store it somewhere you can retrieve it quickly, help you define your queries with GUIs, but track and bulk manage it with IAC, let you export your data in a format that other software tools can work with, and so on. Oh, and you have a lot of OKRs you're promising, so you don't have a ton of innovation tokens to spend on all this observability and monitoring stuff, so please don’t make you customize all your application code to fit a specific platform. Ah, and, last but not least, make sure that's all affordable, enough that it fits in you budget, which is suddenly getting more eyeballs on it, C.R.E. A.M. and all that.

That's not to say the sort’ve infinitely flexible querying, in ways so simple even a mere mortal Technical Program Manager or Business Analyst could use them, isn't also useful. I just think it's day 2.

Is that so controversial? Heinrich Hartmann evangelizes this concept of an SRE Triangle, that's similar-ish to how people talk about CAP Theorem for Databases. But rather, the SRE Triangle consists of Reliability, On-Call Health, and Productivity.

They tend to tradeoff against each other. The faster you ship, the less reliable your software tends to be, and if you do manage to ship really fast and need to provide really high levels of reliability, you end up with on-call that is a nightmare for the SREs and folks operating the software. The table stakes stuff I just described helps your organization position itself on the triangle, via SLOs. SLOs are constructed with pre-baked metrics, generally 7 or 30 day rolling windows on Prometheus recording rules, last I checked. The deeper insights and effectiveness of Observability 2.0 and 3.0 aren't really a pre-requisite here, as helpful as they may be.

"But Eric!", you interject, a wry smile forming, such smart are you, so wise in their ways. "You're simply explaining an Observability Maturity Model! You Fool, You complete Bozo!" Yea, I guess so? I think I'm saying, the pressing concern is not getting up-and-to-the-right at all costs in 2025, for most people, despite what the cool kids are saying online. Maybe you, dear reader, are simply built different. That's ok. But I think you'd be the minority here, not me.

To wit, The latest craze these days seems to be the concept of "Pipelines", like Telemetry Pipelines. They all advertise this growing concern over reducing Observability Spend. "You're being had!", they shout, "The Bus is Out of Control!"

It all feels like a bit of theater, no? Is the most pressing concern getting to Observability 2.0? Or is the most pressing concern delivering a cost-effective monitoring solution? Does following one of those Observability up-and-running guides mean I've potentially foot-gunned myself, and need to scramble to shove a pipeline into place to manage the bleeding? Why am I being sold a product at all if it comes with a big warning label that says "Probably too expensive for a simple caveman lawyer like you"? This feels like being sold a very shitty wrapper over BigQuery.

A good Observability Platform does more than let you write really gnarly queries of data in real time storage systems. A good Observability Platform will curate that firehose of raw observational data (aka OpenTelemetry), for you effectively, establishing reasonable tradeoffs and limits for how a User can interact with that data. A Golden Path, as they say. "Do things this way and you will have a good time on My Very Special Venture/PE/Cisco Owned Observability Platform Multivac Fun Zone". Sometimes you can buy that wholesale from a Vendor. Other times Vendors will sell you multiple separate point solutions (*SKUs*) and you'Il have to wire it together yourself. Other times, you'll have to operate the whole thing on your own, because Compliance, or NIH, or whatever. But the work must get done, somewhere, by someone.

That's the most pressing concern, in my opinion, for Observability in 2025. "Who’s gonna carry the boats!?", as David Goggins (*gags involuntarily*) says. Who will do the work? Who will make sure you have what you need on day 1 to observe and investigate your systems, easily, at a sustainable cost. Does your vendor think it's you? Do you think it's your vendor? You'd be surprised.

Vendor Solutions Architect meets Customer with Brownfield deployment

If you're telling users the most important thing they need to do next is upgrade to Observability 2.0, they'll believe you. Maybe they shouldn't.

Thanks for reading,

Eric Mustin

A Small, Good Thing

Discussion about this post