What are SLI, SLO and SLA, and Why are they important in SRE?

RMAG news

Aspect
Service Level Indicator (SLI)
Service Level Objective (SLO)
Service Level Agreement (SLA)

Definition
A metric that measures the performance of a service. Examples include latency, error rate, throughput, and availability.
A target value or range of values for a particular SLI over a specified time period. It represents the goal for the service’s performance.
A formalized contract between a service provider and a customer outlining expected performance standards (often defined by SLOs) and the consequences of not meeting those standards.

Purpose
To provide a quantifiable measure of some aspect of the service’s performance.
To set specific, measurable goals for service performance based on SLIs.
To establish clear expectations and responsibilities between the service provider and the customer, including penalties or compensations for not meeting the agreed performance levels.

Examples
– Latency (e.g., 95th percentile response time)
– Error rate (e.g., percentage of failed requests)
– Availability (e.g., uptime percentage)
– Throughput
– 95th percentile latency should be less than 100ms over the last 30 days
– Error rate should be less than 0.1% over the last 7 days
– Availability should be 99.9% over the last month
– The service will maintain 99.9% availability each month. If availability drops below this threshold, the service provider will credit the customer 10% of their monthly fee for each 0.1% drop below the threshold, up to 50%.

Who Defines It
Engineers and service owners who monitor and manage the service.
Engineers and service owners in collaboration with business stakeholders to ensure the objectives meet business needs and are achievable.
Business leaders, legal teams, and sometimes customers, often with input from engineers and service owners to ensure technical feasibility.

Scope
Specific aspects of service performance, typically focused on technical metrics.
Broader than SLIs, encompassing goals for multiple SLIs, often with business impact considerations.
Broad and formal, encompassing agreed-upon performance standards, legal obligations, and financial implications.

Timeframe
Continuous, providing real-time or near-real-time data on service performance.
Specific periods (e.g., weekly, monthly) over which the service’s performance is evaluated against the objective.
Defined contract period, typically monthly or annually, with regular reviews and updates as needed.

Consequences of Not Meeting
Not meeting SLIs typically triggers alerts for engineers to investigate and resolve issues.
Not meeting SLOs may lead to internal reviews and action plans to improve service performance.
Not meeting SLAs typically results in penalties, such as service credits or refunds to the customer, and can damage the service provider’s reputation and customer trust.

Visibility
Primarily internal, used by the service provider’s engineering teams.
Internal, with visibility to both engineering teams and business stakeholders.
External, visible to both the service provider and the customer, often documented in legal contracts.

Example Metrics
– Average response time
– Number of errors
– System uptime
– Request per second
– 99.9% of requests should have response times under 200ms
– Error rate should not exceed 0.1% over a month
– 99.99% uptime
– 99.9% availability per month
– 99% of support requests answered within 24 hours
– 95% of incidents resolved within 4 hours

Summary

SLI: Measures specific aspects of service performance.

SLO: Sets target performance levels for SLIs.

SLA: Formal agreement that includes SLOs and outlines the consequences of not meeting them.

Therefore, service providers can effectively manage and communicate their service performance, ensuring alignment with customer expectations and business objectives.