Last month I reached out to some folks in my network to learn more about how they were leveraging Service Level Objectives (SLOs) in their organization. I’ve been encountering them more and more in the wild as the end-all-be-all of “What Good Looks Like” for an Engineering organization that cares about reliability. Why? Well, at a high level, it seems like folks spending collectively $12 Billion on Monitoring and Observability want to use SLOs to answer their most pressing question — Are my services reliable? But why SLOs specifically? Probably because thats how Google does it. As described in the Google SRE Handbook, Service Level Objectives (SLOs) are the highest-fidelity signal of a service’s health, purpose built to tackle just that question of whether a service is meeting it’s reliability objectives or not. And yet, in both my personal experience, and when I spoke to people about them anecdotally, it seemed like no one actually found them to be terribly useful. Is this really the case? Are SLOs merely cargo-cult reliability theater, or are they actually an effective tool in the hands of the right teams?
To this end, I ended up chatting and getting responses from about a dozen folks over the course of a month or so, mostly all senior-ish folks within engineering, both on the IC and Management tracks. While they weren’t exclusively in dedicated SRE roles, it definitely was a bit of a self-selecting group. What did vary was the breadth and range of employers, with folks chiming in from startups, Nasdaq listed tech companies, fortune 500 enterprise, and SMB/Middle-Market. The SRE Cinematic Universe, so to speak. I thought it would make sense to share back the learnings, which I’ll keep anonymous and brief.
“Adopting SLOs” is project based work
Despite the variability in company size and type, most folks describe their efforts around SLOs as chunky, top-down initiatives. There is usually some ambitious goal, OKR, or amount of work that gets prescribed. (i.e. “by end of year, <some percentage> of Tier 1 services have an SLO published in <some tool>”) Many sprints are dedicated, either by a specific team, or fanned out as smaller chunks to individual teams. There’s an Ex-FAANG Executive who mandates a Hoover-esque Chicken in Every Pot, SLO for every service. There’s a tool for generating SLOs, or doing some static analysis to suggest SLOs, or generating some statistics and summaries for executive reporting, or cleaning up useless alerts, and so on. It was very rare to hear stories of SLOs happening bottoms up in organizations, the way you maybe hear about some other Monitoring/Observability/Reliability adjacent tools gaining adoption.
Managing SLOs is an ongoing process that reflects organizational maturity
Unlike the way adoption happens, the maintenance/iteration/use of those SLOs in decision making was almost always a by-product of existing engineering culture and maturity. Among the most mature technology organizations, there were success stories of SLOs adoption / alert fatigue reduction, as well as some sense of on-going or recurring work around their management and maintenance. Alerts were regularly reviewed, SLOs were incorporated into postmortems, SRE engaged with Engineering teams in a consultative way to help them select appropriate SLOs/SLIs, and occasionally product level decision making was even guided by SLOs. But these efforts were the by-product of a mature engineering culture, with existing effective communication practice, where SRE already had a “seat at the table” among executives, rather than SLOs being a tool that creates that culture. And even then, it usually took significant effort from SRE to get to a baseline level of effectiveness, and the ROI wasn’t obvious.
Among both relatively less mature technology organizations, and also those organizations for which shipping too slow represented an existential risk (ie startups), it was generally the case that SLOs were ineffective. Product frequently put them on the back-burner or ignored them completely, and they experienced significant decay in usefulness, becoming relegated to spamming some slack channel, or disregarded entirely.
SLO Theory vs SLO Practice are wildly different
Many respondents felt like SLOs, as advertised “on the tin” by the Google SRE Handbook et al, were significantly different than how they were used in practice in their organization. In some cases, this was as simple as the calculation methodology itself being inaccurate or useless (ex: teams using a 1-day rolling window for an SLO burn rate window, creating massive alert fatigue and resulting in teams simply turning them off), in other cases it was the lack of buy in from Product Owners on when to address reliability vs feature development (this was the most common criticism, SLOs were adopted and nobody cared), in other cases SLOs were not actually leveraged internally to communicate expected levels of reliability of different services, and in yet others it was the lack of table stakes effort or buy-in from the engineering teams to really participate in the process of defining SLOs, picking appropriate SLIs, setting realistic error budgets, or tuning burn rate alerting methodologies.
It was common to hear about SLOs devolving into what one respondent termed just another “fancy alert” that ends up filling up a slack channel and little else. One respondent offered a good rule of thumb for SREs attempting to evangelize SLOs: Treat engineers as non-technical users. The cognitive load of SLOs and all their gnarly bits, while engineering related, are not really something the average product focused engineer wants to (or even really ought to) concern themselves with. This really resonated with me from years of trying (and failing) to get the average engineer to want to write PromQL.
The exception here was organizations which provided specific SLAs around SaaS type products. In these cases, there were some success stories of teams “buying in” to SLOs, in the sense that multiple stakeholders were involved in their formation and reporting, there was care and consideration put into alerting strategies related to them, and Product, Engineering, and SRE all demonstrated a vested interest in setting realistic Objectives, measuring accurate SLIs, incorporating them into decision making at the product level, and generally making them a part of workflows. But even this was usually specific to teams with Products that were directly customer facing, ie, at the start of the call chain. As respondents encountered services deeper into the call chain, whether or not related to an SLA, the adoption and effectiveness of SLOs decreased significantly.
SLI selection and managing error budget is non-obvious
A number of respondents voiced issues with who chooses the SLIs and what data is available to choose from. This usually was the case for teams with services far down the call chain in the user experience, like an ETL pipeline or Database. It wasn’t obvious which signals to use as SLIs that tie back to measuring the User’s Experience, and it wasn’t clear how to manage SLOs which have a number of dependencies which themselves do not have SLOs defined. In other cases, it was pointed out that patterns like the SAGA pattern don’t really translate well to having a clean SLI out of the box that fit neatly into the SLO tool they were using. In others cases, respondents felt that the SLI was too simplistic, ie, a blackbox ping on a url might be used for SLIs of a web service, but that often oversimplified the actual mechanics of that web service and whether the user could interact with it correctly.
Additionally, it was common to hear that most stakeholders didn’t really understand how to think about the error budget holistically. The obvious case has already been mentioned: Product, when push comes to shove, often simply do not care whether a team needs to invest more to defend its SLO. The features simply must be shipped asap, SLO be damned. But on the flip side, some respondents also pointed out that even among engineering teams with stellar track records of meeting or exceeding error budgets, there was rarely any willingness from engineering to burn error budget, take risks, or set a higher Objective. They would rather just pad stats, so to speak.
Overall, the majority of respondents felt that investing in SLOs was a low ROI effort for an SRE team, especially without explicit buy-in from other stakeholders in the process. SLOs often require both an initial upfront investment, and also ongoing, recurring work and inclusion in engineering cultural practices within the organization.
When these circumstances exist, when teams can offer collaborative tools to help their engineers generate SLOs, measure them, communicate them, iterate on them, and incorporate them into decision making processes, there are teams that have found success with SLOs for improving reliability, reducing alert fatigue, and helping drive product decisions. But there was no magic bullet, even in these situations, as it sometimes took multiple iterations to find a happy medium, and SRE orgs often had to provide bespoke tooling, reporting, and communication practices around SLOs that were not simply out of the box offerings from their SLO “vendor”.
These situations were the exception to the rule. Practically speaking, it was rare for SRE teams to have the amount of political capital within an organization to be able to deliver both that initial investment, and also the long term cultural changes necessary to incorporate SLOs into engineering and product decision making. This was especially true when it was largely a matter of measuring user experience holistically, rather than adhering to specific contractual SLAs.
Thanks to all the folks who took the time to chat with me on SLOs, SRE, and how their organizations think about reliability. Hopefully folks reading this can find some value in these shared learnings, and think about whether SLOs are a fit for their organization, or what commonalities these anecdotes have with their own experience.
Thanks for reading,
Eric Mustin