DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

👋 Sign in for the ability to sort posts by relevant, latest, or top.
Runbook Automation: From 45-Minute Fixes to 90-Second Recoveries

Runbook Automation: From 45-Minute Fixes to 90-Second Recoveries

2 min read
When Platform Engineering Drifts into Control: Why your internal platform may be killing engineering judgement

When Platform Engineering Drifts into Control: Why your internal platform may be killing engineering judgement

6 min read
Ideas

Ideas

2 min read
Error Budgets in Practice: A No-BS Guide

Error Budgets in Practice: A No-BS Guide

2 min read
System Design Journey — Week 4: Reliability, Failures & Designing a Payment API

System Design Journey — Week 4: Reliability, Failures & Designing a Payment API

3 min read
3am Incident Response: What I Learned from 200+ Pages

3am Incident Response: What I Learned from 200+ Pages

2 min read
Async LLM inference in CI: stop build workers blocking on slow jobs

Async LLM inference in CI: stop build workers blocking on slow jobs

4 min read
The SRE's Guide to Surviving Tool Sprawl

The SRE's Guide to Surviving Tool Sprawl

2 min read
I built an AI incident copilot that does not store your production logs

I built an AI incident copilot that does not store your production logs

6 min read
Protective Computing: Software Should Fail Safely Under Stress

Protective Computing: Software Should Fail Safely Under Stress

5 min read
99.9% uptime is 43 minutes a month. Do you know your number?

99.9% uptime is 43 minutes a month. Do you know your number?

4 min read
I Reduced Our Alert Volume by 90% — Here's the Playbook

I Reduced Our Alert Volume by 90% — Here's the Playbook

2 min read
A single probe saying "down" shouldn't wake you at 3am

A single probe saying "down" shouldn't wake you at 3am

2 min read
Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams

Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams

3 min read
Postmortem: When AI Meets Resilience — AWS Resilience Hub and SRE

Postmortem: When AI Meets Resilience — AWS Resilience Hub and SRE

10 min read
👋 Sign in for the ability to sort posts by relevant, latest, or top.