Tech With Htunn
  • Blog Content
  • ๐Ÿค–Artificial Intelligence
    • ๐Ÿง Building an Intelligent Agent with Local LLMs and Azure OpenAI
    • ๐Ÿ“ŠRevolutionizing IoT Monitoring: My Personal Journey with LLM-Powered Observability
  • ๐Ÿ“˜Core Concepts
    • ๐Ÿ”„Understanding DevSecOps
    • โฌ…๏ธShifting Left in DevSecOps
    • ๐Ÿ“ฆUnderstanding Containerization
    • โš™๏ธWhat is Site Reliability Engineering?
    • โฑ๏ธUnderstanding Toil in SRE
    • ๐Ÿ”What is Identity and Access Management?
    • ๐Ÿ“ŠMicrosoft Graph API: An Overview
    • ๐Ÿ”„Understanding Identity Brokers
  • ๐Ÿ”ŽSecurity Testing
    • ๐Ÿ”SAST vs DAST: Understanding the Differences
    • ๐ŸงฉSoftware Composition Analysis (SCA)
    • ๐Ÿ“‹Software Bill of Materials (SBOM)
    • ๐ŸงชDependency Scanning in DevSecOps
    • ๐ŸณContainer Scanning in DevSecOps
  • ๐Ÿ”„CI/CD Pipeline
    • ๐Ÿ”My Journey with Continuous Integration in DevOps
    • ๐Ÿš€My Journey with Continuous Delivery and Deployment in DevOps
  • ๐ŸงฎFundamentals
    • ๐Ÿ’พWhat is Data Engineering?
    • ๐Ÿ”„Understanding DataOps
    • ๐Ÿ‘ทThe Role of a Cloud Architect
    • ๐Ÿ›๏ธCloud Native Architecture
    • ๐Ÿ’ปCloud Native Applications
  • ๐Ÿ›๏ธArchitecture & Patterns
    • ๐Ÿ…Medallion Architecture in Data Engineering
    • ๐Ÿ”„ETL vs ELT Pipeline: Understanding the Differences
  • ๐Ÿ”’Authentication & Authorization
    • ๐Ÿ”‘OAuth 2.0 vs OIDC: Key Differences
    • ๐Ÿ”Understanding PKCE in OAuth 2.0
    • ๐Ÿ”„Service Provider vs Identity Provider Initiated SAML Flows
  • ๐Ÿ“‹Provisioning Standards
    • ๐Ÿ“ŠSCIM in Identity and Access Management
    • ๐Ÿ“กUnderstanding SCIM Streaming
  • ๐Ÿ—๏ธDesign Patterns
    • โšกEvent-Driven Architecture
    • ๐Ÿ”’Web Application Firewalls
  • ๐Ÿ“ŠReliability Metrics
    • ๐Ÿ’ฐError Budgets in SRE
    • ๐Ÿ“SLA vs SLO vs SLI: Understanding the Differences
    • โฑ๏ธMean Time to Recovery (MTTR)
Powered by GitBook
On this page
  • The Day Our CEO Called: A Tale of Metrics Gone Wrong
  • SLAs: The Promise You Make to Others
  • SLOs: The Promise You Make to Yourself
  • SLIs: The Truth You Can't Hide From
  • How I Tie Them All Together: A Real-World Example
  • 1. We Start with the Customer Promise (SLA)
  • 2. We Set Our Internal Target (SLO)
  • 3. We Measure Relentlessly (SLI)
  • Lessons Learned the Hard Way
  • 1. The Perfect SLI Fallacy
  • 2. The Copy-Paste Trap
  • 3. The Aggregation Blindness
  • Practical Implementation Tips
  • 1. Start With User Journeys
  • 2. Choose the Right Measurement Points
  • 3. Build Dashboards That Tell a Story
  • 4. Create an SLO Review Process
  • From Metrics to Culture
  • Conclusion: The Never-Ending Journey
  1. Reliability Metrics

SLA vs SLO vs SLI: Understanding the Differences

When I first stepped into the role of a DevSecOps Lead at a rapidly scaling startup, I found myself drowning in acronyms. Among the most confusing (and most important) were SLA, SLO, and SLI. Over the years, I've learned to not only understand these concepts but to leverage them to build more reliable systems and happier engineering teams. Let me share what I've discovered along the way.

The Day Our CEO Called: A Tale of Metrics Gone Wrong

I still remember the day our CEO stormed into our engineering area, waving a customer contract. "We promised 99.9% availability in this SLA," he said, "but the customer says we've been down for hours this month. What's going on?"

That moment crystallized an important lesson: without clear, measurable reliability targets and a way to track them, we were setting ourselves up for failure. This kicked off my deep dive into the world of service level metrics.

SLAs: The Promise You Make to Others

A Service Level Agreement (SLA) is essentially a promise with consequences. Think of it as the reliability warranty you offer to your customers.

In my experience, SLAs have these key characteristics:

  1. They have teeth โ€“ When we miss our SLA for our payment processing API, we issue automatic credits to our enterprise customers. Last quarter, SLA violations cost us $37,000 in credits.

  2. They're conservative โ€“ We never put anything in an SLA that we're not confident we can deliver. Our internal target is always stricter than what we promise customers.

  3. They're business decisions, not just technical ones โ€“ I learned this the hard way when I tried to set a 99.99% uptime SLA without consulting our product and sales teams. They quickly reminded me that such promises have real financial implications!

A real SLA I've worked with looked something like this:

SERVICE LEVEL AGREEMENT FOR PAYMENT API

Availability Guarantee: 99.9% monthly uptime
Calculation Method: (Total minutes in month - Downtime minutes) / Total minutes
Exclusions: Scheduled maintenance windows, force majeure events
Remedy: For each 0.1% below guaranteed availability, customer receives 5% service credit
Maximum Credit: 30% of monthly service fee

SLOs: The Promise You Make to Yourself

A Service Level Objective (SLO) is an internal goal we set for our systems. It's the reliability target we aim for as engineers.

The most important lesson I've learned about SLOs is that they need to be stricter than SLAs. Here's why:

  1. SLOs provide a buffer โ€“ We set our SLO for our payment API at 99.95%, even though our SLA is 99.9%. This gives us a safety margin.

  2. SLOs drive engineering priorities โ€“ When we're close to violating our SLO, feature development takes a backseat to reliability work.

  3. SLOs are more granular โ€“ While our customer-facing SLA might be monthly, we track our SLOs on a daily, weekly, and monthly basis.

Here's how I typically structure our SLOs:

Service: User Authentication API
SLO Type: Availability
Target: 99.95% successful responses
Measurement Window: Rolling 30-day period
Alert at: 99.97% (early warning), 99.95% (critical)
Action: If below 99.95%, pause feature deployment until improved

SLIs: The Truth You Can't Hide From

A Service Level Indicator (SLI) is the actual measurement of how your service is performing. It's the reality check that tells you whether you're meeting your SLOs and SLAs.

After years of fine-tuning, I've learned that good SLIs share these characteristics:

  1. They reflect user experience โ€“ Our initial SLI for our checkout flow only measured API availability, but users were experiencing timeouts during peak hours. We now include latency as a critical SLI.

  2. They're simple to calculate โ€“ Complex SLIs are hard to explain and harder to trust. We aim for "requests / good requests" formulations whenever possible.

  3. They're collected automatically โ€“ Manual reporting creates blind spots. All our SLIs feed directly into dashboards and alerting systems.

One of our most useful SLIs is structured like this:

SLI: API Response Success Rate
Definition: (Successful responses with latency < 300ms) / (Total requests)
Data Source: Load balancer logs โ†’ Prometheus โ†’ Grafana
Collection Frequency: Real-time, aggregated per minute
Visualization: 99th percentile shown on team dashboards
Alerting: Triggers warning at <99.97%, critical at <99.95%

How I Tie Them All Together: A Real-World Example

Let me share how we apply these concepts to our user authentication service:

1. We Start with the Customer Promise (SLA)

Our customer contracts specify that user authentication will be available 99.9% of the time, measured monthly. Missing this means automatic credits.

2. We Set Our Internal Target (SLO)

Given the criticality of authentication, we set our internal SLO at 99.95% availability. This gives us a buffer of 0.05% (about 22 minutes per month) before we breach our SLA.

3. We Measure Relentlessly (SLI)

Our SLI for authentication is the ratio of successful authentication attempts (returning within 200ms with a valid token) to total authentication attempts. This is measured and aggregated in real-time.

Here's how our monitoring dashboard displays this relationship:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Authentication Service Reliability                   โ”‚
โ”‚                                                      โ”‚
โ”‚ Current SLI (30-day): 99.982%                       โ”‚
โ”‚                                                      โ”‚
โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€    โ”‚
โ”‚                                                      โ”‚
โ”‚ SLO Target: 99.95% โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”             โ”‚
โ”‚                                        โ”‚             โ”‚
โ”‚ SLA Commitment: 99.9% โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜             โ”‚
โ”‚                                                      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Lessons Learned the Hard Way

Over the years, I've made plenty of mistakes with these metrics. Here are three that taught me the most:

1. The Perfect SLI Fallacy

Early in my career, I tried to create the "perfect" SLI that captured every nuance of user experience. The result was an overcomplicated formula that nobody understood or trusted. Now I focus on simple, clear SLIs that directly map to user experience.

2. The Copy-Paste Trap

I once borrowed SLO targets from Google's SRE book without considering our specific context. Setting a 99.999% availability SLO for a non-critical internal tool was overkill and wasted engineering resources. Now we right-size our SLOs to the business impact of each service.

3. The Aggregation Blindness

We used to measure our API availability as an aggregate across all endpoints. This hid serious problems with specific critical endpoints. We now track separate SLOs for our most important API paths.

Practical Implementation Tips

If you're implementing these concepts in your organization, here are some practical tips:

1. Start With User Journeys

Before defining SLIs, map out your critical user journeys. For example, for our e-commerce site, these include:

  • User login

  • Product search

  • Adding to cart

  • Completing checkout

  • Viewing order status

For each journey, ask: "What metric would tell us if this is working well for users?"

2. Choose the Right Measurement Points

We collect SLI data from multiple sources:

  • Load balancer logs for availability and latency

  • Application logs for functional correctness

  • Client-side telemetry for full user experience

3. Build Dashboards That Tell a Story

Our reliability dashboards show:

  • Current SLI performance

  • Historical trends

  • SLO thresholds with time remaining in the measurement window

  • Alerts when we approach SLO boundaries

4. Create an SLO Review Process

Every quarter, we review our SLOs against:

  • Customer feedback and support tickets

  • Business impact of reliability issues

  • Engineering cost of maintaining the current SLOs

From Metrics to Culture

The most powerful realization I've had is that SLAs, SLOs, and SLIs aren't just technical metricsโ€”they shape engineering culture. When we get these right, several positive things happen:

  1. Product and engineering speak the same language โ€“ "We can't ship this feature because we're approaching our SLO threshold" becomes a statement everyone understands.

  2. On-call becomes less stressful โ€“ Clear SLIs tell us exactly when to act and when a situation isn't critical.

  3. Investments in reliability become quantifiable โ€“ "This refactoring will improve our authentication SLI by 0.1%" is a concrete, defensible engineering priority.

Conclusion: The Never-Ending Journey

My journey with service level metrics continues to evolve. The systems we build grow more complex, user expectations increase, and our understanding of what "good reliability" means changes with them.

What hasn't changed is the fundamental value of these three concepts:

  • SLAs define our promises to others

  • SLOs define our promises to ourselves

  • SLIs tell us if we're keeping those promises

These simple but powerful tools have transformed how I approach service reliability. They've helped me turn the abstract concept of "reliability" into concrete actions and decisions that my team and business stakeholders can understand and support.

Whether you're just starting your SRE journey or looking to refine your approach, I hope my experiences help you navigate these concepts more effectively in your organization.

PreviousError Budgets in SRENextMean Time to Recovery (MTTR)

Last updated 18 hours ago

๐Ÿ“Š
๐Ÿ“