What are SLAs, SLOs and SLIs? The Key to Building Reliable Software

In today’s digital age, nearly every organization relies on a software service in some capacity. Service disruptions can be disastrous for business, especially when they affect the end-user experience and lead to a negative business impact. Nearly every tech company strives to deliver a seamless and uninterrupted experience for their users and wants to ensure that they provide maximum uptime. But how do they ensure that billions of users can access their services smoothly, even if the underlying infrastructure has ongoing problems?

Imagine if your favorite music streaming application i.e. Spotify suddenly stopped working, or provided you with a degraded experience. If you love listening to some music during long commutes, you’re journey just becomes a lot more boring. In November 2018, Spotify ended up deleting all of its production Kubernetes clusters while running tests for a new feature. Despite such a disaster, Spotify was able to maintain a 99.9% uptime and users were able to enjoy the streaming service without any major disruptions.

Similar to Spotify, nearly every tech organization has a certain uptime percentage that they are required to achieve every single month. This monthly target is not a “nice to have” goal, but rather it is a key indicator of whether they can deliver on their promises to their end users. In the above story, the promise from Spotify was to provide a streaming service that is available 99.9% of the time.

These different promises that are made by companies are often defined by a document called a Service Level Agreement (SLA). These SLAs consist of different Service Level Objectives(SLOs), which can be achieved when certain targets are met in Service Level Indicators (SLIs). Many organizations define and track SLAs, SLOs, and SLIs to create a more reliable service for their users.

Within this blog, you will learn what are SLAs, SLOs, and SLIs, how they work together to help create a more reliable service, and understand how companies such as Spotify build systems that are reliable even in the face of disasters.

SLA: Service Level Agreement

A Service Level Agreement, or SLA, is an agreement between the company and users about the expected reliability of the application or service. This agreement typically has expectations of service reliability, application latency, availability, responsiveness, etc. If the terms defined within the SLA are not met, some consequences are agreed upon by both parties, before the SLA is signed.

SLAs can directly impact customer satisfaction and company revenue. SLAs are typically written by a business or legal team and many organizations quote a 99% or 99.9% availability within their SLAs. However, before such a number is quoted, the technical feasibility of this should be checked with the technical teams.

SLAs are typically made between a company (a service vendor) and a paying customer. Services that are a part of a company’s free tier do not require an SLA to be defined.

SLO: Service Level Objective

A Service Level Objective, or SLO, is an agreement made between a company and its users about a specific metric such as latency or uptime. While an SLA is the entire agreement between the two parties, SLOs are individual promises that encompass the complete agreement.

The SLO is a specific goal that must be met, to meet the compliance standards set within the SLA as a whole. An SLO should set the lowest level of reliability that your services can get away with. For example, if your services can function properly at a minimum of 97.3% reliability, that’s what your SLO should be defined as.

SLOs are useful to define for both paying customers, and the free to use services. SLOs can help companies improve the overall reliability and quality of their services.

SLI: Service Level Indicator

A Service Level Indicator, or SLI, provides a measurable metric that is used to determine if the SLO standards are being met or not. It is the measured value of the metric defined within the SLO at a given time. For example, if Spotify’s SLO is set at 99.05%, then the SLI must be at least 99.05% or higher to ensure that the compliance set by the SLA is met. To ensure that the services are SLA compliant, DevOps engineers are SRE teams need to ensure that the SLI values always meet, or exceed the SLO.

Having a good incident response plan in place is a critical step in ensuring that any potential downtime is minimized and you can stay within the set compliance standards. SLIs are essential for monitoring how well a service can adhere to the SLO. Without clearly defined SLIs, it becomes very difficult to accurately measure performance.

Relation between SLA, SLO & SLI

SLAs, SLOs, and SLIs are all components of the same puzzle. They are used in unison to achieve certain compliance standards for various services. SLAs are used to externally define an agreement between a company and its paid customers. This agreement is usually aimed at achieving certain reliability and availability goals for applications.

SLOs and SLIs are a part of the SLA. An SLO is a key objective that needs to be met to be compliant with the SLA that’s been agreed upon. If an SLO is not met, teams are required to take swift action to ensure that the terms set in the SLA are not broken.

An SLI is a metric to measure the SLO. These metrics can consist of status such as application latency, uptime, number of successful requests, etc.  

While all these three are interlinked, there are some differences as well among SLA, SLO, and SLI. The below table gives a quick overview of how the three indicators help:

SLA SLO SLI
What is it? Guarantee between the company and users promising a level of availability A level of availability for a service where the user is happy and can use the application Indicators of system health based on priorities of users.
How does it help? Builds confidence for customers Prevents SLO breaches and provide an error budget Highlights the priority of users for your services
Who creates them? Customer-facing teams such as Sales while collaborating with Legal teams DevOps and SRE teams, in collaboration with customer-facing teams to ensure that realistic expectations are set DevOps and SRE teams, in collaboration with customer-facing teams to ensure that realistic expectations are set
Consequences of a Breach Consequences that have been set in the SLA which may include reimbursement of credits or even legal action Code freezes and other preventative measures to ensure the SLA is not breached N/A


Strategies to achieve SLA/SLO/SLI targets

The entire idea behind setting SLAs, SLOs, and SLIs is to increase the reliability of applications and the underlying infrastructure. There are many different methods in which you can ensure that your clusters are resilient to failure and provide maximum reliability. In an ideal world, you would want your services to be 100% reliable. In reality, it’s much more plausible to target a close-to-100% reliability score such as 99.5%.

There are many different strategies that companies can adopt to increase the reliability of their services and minimize potential downtime. Some of the common strategies include provisioning and configuring high availability clusters, having multiple availability zones, ensuring that your applications have sufficient resources allocated to them while making sure it's cost-efficient, and many more.

Conclusion

SLAs, SLOs, and SLIs are three components of the same puzzle. Organizations create and follow these standards to ensure that their applications are reliable. They can even act as legally binding documents to ensure that a service provider or vendor is providing their services as agreed upon. Every tech organization sets certain SLAs in place, which helps them improve their applications’ performance, while also keeping their customers happy.