What is Site Reliability Engineering?
How I Discovered SRE Was the Missing Piece
The SRE Principles That Changed How I Work
1. Automation Is Non-Negotiable
import boto3
def lambda_handler(event, context):
# Get current metrics from CloudWatch
cloudwatch = boto3.client('cloudwatch')
ec2 = boto3.client('ec2')
# Get CPU utilization metrics
response = cloudwatch.get_metric_data(
MetricDataQueries=[
{
'Id': 'cpu',
'MetricStat': {
'Metric': {
'Namespace': 'AWS/EC2',
'MetricName': 'CPUUtilization',
'Dimensions': [
{'Name': 'AutoScalingGroupName', 'Value': 'my-app-asg'},
]
},
'Period': 300,
'Stat': 'Average',
}
},
],
StartTime=context.timestamp - 600, # 10 minutes ago
EndTime=context.timestamp
)
cpu_utilization = response['MetricDataResults'][0]['Values'][0]
# Decision logic
if cpu_utilization > 75:
# Scale up
print(f"CPU at {cpu_utilization}% - Scaling up")
autoscaling = boto3.client('autoscaling')
autoscaling.set_desired_capacity(
AutoScalingGroupName='my-app-asg',
DesiredCapacity=10,
HonorCooldown=True
)
return "Scaled up due to high CPU"
return "No scaling action needed"2. Embracing Service Level Objectives (SLOs)
3. Observability as a Superpower
My GitLab-AWS Integration for SRE Success
Lessons Learned from My SRE Journey with AWS and GitLab
My AWS SRE Toolkit
Closing Thoughts
Last updated