SLO, SLA, SLI: What They Actually Mean and How to Use Them

Learn what SLO, SLA, and SLI actually mean, how they relate to each other, and how to set targets you can honestly hit.

6 May 2026

Most teams I've worked with can recite these three acronyms but mix them up the moment pressure is on. In a postmortem, someone will say "we violated our SLA" when they mean the team missed an internal reliability target. In a design review, someone will propose "adding an SLO" as if it's a checkbox. The confusion is not harmless. When the concepts blur, you end up committing to things you can't measure, measuring things that don't matter to customers, and having SLA conversations with legal that are completely disconnected from how the system actually behaves.

So here is a clear account of what each one is, how they relate, and where most teams go wrong.

SLI: The Raw Measurement

SLI stands for Service Level Indicator. It is a quantitative measurement of some aspect of the service behavior. Nothing more.

A good SLI is a ratio: the count of good events divided by the total count of events. "Good" is defined by the team, specific to the user journey being measured.

Examples:

The fraction of HTTP requests that return a 2xx response in under 500ms
The fraction of background jobs that complete within their expected window
The fraction of API calls that return valid, non-error responses

Bad SLIs tend to be infrastructure metrics: CPU usage, memory consumption, queue depth. These can correlate with user pain, but they don't directly measure it. A server at 95% CPU might be serving requests perfectly. A server at 30% CPU might be silently dropping half of them. Infrastructure metrics are useful for debugging. SLIs are for measuring reliability as the user experiences it.

When defining an SLI, you are answering the question: "How do we know if users are getting a good experience, and how do we express that as a number?"

SLO: The Target You Set for Yourself

SLO stands for Service Level Objective. It is the target value (or range) you set for an SLI. This is an internal commitment, made by the engineering team to itself and to the business.

If the SLI is "fraction of requests served successfully under 500ms," the SLO might be "99.5% of requests over any 30-day rolling window."

The SLO is where reliability becomes a design constraint. Once you have one, every architectural decision has to contend with it. You can't just add a new synchronous dependency without asking whether it affects your 99.5% target. You can't defer on-call response indefinitely without burning through your error budget.

Error budgets are the practical consequence of SLOs. If your SLO is 99.5% over 30 days, you have 0.5% of requests where failure is acceptable. That is your error budget. When it is full, you can ship aggressively. When it is depleted, you slow down, freeze releases, and focus on reliability. This is the mechanism that makes SLOs useful rather than decorative.

Most teams set their first SLO too high. I have seen teams new to SRE practices write down 99.99% because it sounds responsible, with no data to suggest they are anywhere near capable of hitting it. You should set your SLO based on what you can actually achieve given current system behavior, with a small stretch target. Start by measuring your actual SLI over the past 30 days. Set your SLO slightly below that to give yourself a realistic baseline. Then tighten it over time as the system improves.

Setting an SLO you cannot hit is worse than having no SLO at all. It creates permanent error budget debt and teaches the team to ignore the numbers.

SLA: The External Promise with Consequences

SLA stands for Service Level Agreement. It is a contract with your customers, typically with legal and financial consequences if violated. An SLA is almost always a weaker version of your SLO.

If your SLO is 99.5%, your SLA might commit to 99%. You need that buffer. The SLA is what you're willing to put money on. The SLO is what your team is working to achieve. You want room between them so that a bad week doesn't immediately turn into a contractual breach.

SLAs live in the legal domain. SLOs live in the engineering domain. They are connected, but they're different conversations. The mistake I see most often is teams treating their SLO as the SLA, then scrambling when a spike in errors triggers contract clauses.

Some services have no formal SLA at all, just internal SLOs. That is fine, especially for early-stage products. The SLO still matters for the team even without an external contract. It shapes how you prioritize reliability work, how you staff on-call, and when you ship versus when you stop and fix things.

How They Fit Together

The relationship is linear: SLI is what you measure, SLO is the target you set on that measurement, SLA is the contractual promise you make to customers based on your SLO.

SLI feeds the SLO. The SLO informs the SLA. If you don't have good SLIs, your SLOs are fiction. If your SLOs are fiction, your SLA commitments are a liability.

In practice, a mature reliability setup looks like this:

You identify the critical user journeys (login, checkout, file upload). For each one, you define one or two SLIs that directly measure whether the journey is working. You set an SLO for each SLI. You track error budgets. When budgets deplete, the team shifts from feature work to reliability work automatically, without needing a manager to make the call. Your SLA sits safely below your SLO with a meaningful buffer.

Where Most Teams Go Wrong

The first failure mode is measuring the wrong thing. Teams pick SLIs that are easy to instrument rather than SLIs that reflect user experience. "Server uptime" is not an SLI. A server can be up and still return errors on every request. If you want to see how SLIs connect to concrete system behavior, resilience patterns like circuit breakers and timeouts are a good place to look: they are the mechanisms that protect your error budget when a dependency starts misbehaving.

The second failure mode is aspirational SLOs. Picking 99.99% because a competitor claims that number is not a reliability strategy. It is a number you will ignore within a month because it is permanently broken.

The third failure mode is having SLAs that engineering has never seen. I have worked at companies where the SLA was negotiated by sales, signed by legal, and never communicated to the team responsible for the system. The first time it came up was in a customer escalation. Do not let this happen. The team that builds and operates the service needs to know what is in the contract.

The fourth failure mode is treating SLOs as annual metrics. Reliability is a rolling window problem. A 30-day rolling SLO tells you how the system is performing right now. An annual average can hide a catastrophic month.

Choosing Your First SLIs

If you are starting from scratch, don't try to instrument every possible dimension. Pick the two or three things that most directly represent your service working correctly from a user's perspective.

For most web services, this is request success rate and latency. For asynchronous systems, it's job completion rate and processing time. For data pipelines, it's freshness and completeness. Start there. Add more SLIs later as you understand your system better and as you identify reliability gaps that those first metrics miss.

The goal is not comprehensive coverage on day one. The goal is to have a number that goes down when users are having a bad time and goes up when things are working. Everything else is refinement.

A Final Note on Honesty

SLOs are only useful if the team takes them seriously. I've seen teams declare SLO violations and then... do nothing. The error budget depleted, the chart turned red, and releases continued without any change in behavior. At that point the SLO is purely theatrical.

The mechanism only works if there is a real decision connected to the number. When the error budget hits zero, something has to change. What changes, and who decides, should be written down before the first incident. That conversation is harder than setting the target. But it's the conversation that actually determines whether your reliability practice is real or decorative.

If you want to go deeper on reliability design and system architecture decisions, the Gazar Breakpoint YouTube channel covers these topics in detail. I also write about system design and the senior-to-staff engineering transition in the Monday BY Gazar newsletter every week.

Go further

Live cohort on Maven