DEV Community

Samson Tanimawo profile picture

Samson Tanimawo

Building the first Agentic SRE Platform. 100 AI agents that detect, investigate, and resolve incidents autonomously.

Houston Joined on  https://novaaiops.com

Pronouns

He/Him/His

Monitoring Costs Are Out of Control — Here's How to Fix It

Monitoring Costs Are Out of Control — Here's How to Fix It

2 min read

Want to connect with Samson Tanimawo?

Create an account to connect with Samson Tanimawo. You can also sign in below to proceed if you already have an account.

Already have an account? Sign in
Hiring SREs: What I Look For After Interviewing 100+ Candidates

Hiring SREs: What I Look For After Interviewing 100+ Candidates

3 min read
Log Management at Scale: How We Cut Costs 70% Without Losing Signal

Log Management at Scale: How We Cut Costs 70% Without Losing Signal

2 min read
Canary Deployments: The Pattern That Cut Our Rollback Rate by 80%

Canary Deployments: The Pattern That Cut Our Rollback Rate by 80%

2 min read
Platform Engineering: Building an Internal Developer Platform That Teams Actually Use

Platform Engineering: Building an Internal Developer Platform That Teams Actually Use

2 min read
Chaos Engineering for Teams That Aren't Netflix

Chaos Engineering for Teams That Aren't Netflix

3 min read
Distributed Tracing: The Missing Piece of Your Observability Stack

Distributed Tracing: The Missing Piece of Your Observability Stack

3 min read
The Golden Signals: A Practical Implementation Guide

The Golden Signals: A Practical Implementation Guide

2 min read
Kubernetes Observability: What to Monitor and Why

Kubernetes Observability: What to Monitor and Why

2 min read
On-Call Wellness: Protecting Your Engineers from Burnout

On-Call Wellness: Protecting Your Engineers from Burnout

2 min read
Post-Mortem Best Practices That Actually Drive Change

Post-Mortem Best Practices That Actually Drive Change

2 min read
Runbook Automation: From 45-Minute Fixes to 90-Second Recoveries

Runbook Automation: From 45-Minute Fixes to 90-Second Recoveries

2 min read
Error Budgets in Practice: A No-BS Guide

Error Budgets in Practice: A No-BS Guide

2 min read
3am Incident Response: What I Learned from 200+ Pages

3am Incident Response: What I Learned from 200+ Pages

2 min read
The SRE's Guide to Surviving Tool Sprawl

The SRE's Guide to Surviving Tool Sprawl

2 min read
I Reduced Our Alert Volume by 90% — Here's the Playbook

I Reduced Our Alert Volume by 90% — Here's the Playbook

2 min read
Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams

Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams

3 min read
Capacity Planning Without ML: The 80/20 Approach

Capacity Planning Without ML: The 80/20 Approach

2 min read
The On-Call Schedule Math Nobody Does

The On-Call Schedule Math Nobody Does

1
2 min read
Chaos Engineering Is Theater Without These Three Things

Chaos Engineering Is Theater Without These Three Things

1
2 min read
Choosing Your First SLI: A Decision Framework for New SRE Teams

Choosing Your First SLI: A Decision Framework for New SRE Teams

2 min read
Runbook Hygiene: Why Yours Are Lying to You

Runbook Hygiene: Why Yours Are Lying to You

2 min read
Agent Handoff Contracts: The Missing Piece in Production Agent Systems

Agent Handoff Contracts: The Missing Piece in Production Agent Systems

1
3
3 min read
What Is Multi-Agent SRE? A Practical Introduction

What Is Multi-Agent SRE? A Practical Introduction

3 min read
The Future of SRE: What the Next 5 Years Look Like

The Future of SRE: What the Next 5 Years Look Like

3 min read
Building a Career in SRE: From Junior to Staff

Building a Career in SRE: From Junior to Staff

2 min read
Incident Automation: What to Automate, What to Leave to Humans

Incident Automation: What to Automate, What to Leave to Humans

2 min read
Infrastructure Drift: Detecting and Preventing It

Infrastructure Drift: Detecting and Preventing It

2 min read
The Engineer Who Owns Nothing: A Cautionary Tale

The Engineer Who Owns Nothing: A Cautionary Tale

2 min read
Error Budget Policies That Hold Leadership Accountable

Error Budget Policies That Hold Leadership Accountable

2 min read
Dependency Injection for Observability

Dependency Injection for Observability

2 min read
Load Balancer Tuning: Lessons from Production

Load Balancer Tuning: Lessons from Production

2 min read
Capacity Planning for Startups

Capacity Planning for Startups

2 min read
How We Handled Our First Major Outage (And Survived)

How We Handled Our First Major Outage (And Survived)

2 min read
The Economics of Reliability: When to Invest, When to Accept Risk

The Economics of Reliability: When to Invest, When to Accept Risk

2 min read
Why Your Status Page Should Be Boring

Why Your Status Page Should Be Boring

2 min read
Building Trust with Product Teams as an SRE

Building Trust with Product Teams as an SRE

2 min read
Incident Command: The Skills They Don't Teach You

Incident Command: The Skills They Don't Teach You

2 min read
How AI Is Changing SRE Workflows (Without Replacing SREs)

How AI Is Changing SRE Workflows (Without Replacing SREs)

2 min read
Security Monitoring for SRE Teams

Security Monitoring for SRE Teams

2 min read
Instrumenting Legacy Code Without Rewriting It

Instrumenting Legacy Code Without Rewriting It

2 min read
The Case for a Dedicated Reliability Engineer

The Case for a Dedicated Reliability Engineer

2 min read
Runbook-Driven Development: A New Way to Ship

Runbook-Driven Development: A New Way to Ship

1
2 min read
Zero-Downtime Database Migrations

Zero-Downtime Database Migrations

2 min read
API Rate Limiting: Patterns That Scale

API Rate Limiting: Patterns That Scale

2 min read
Kubernetes Upgrades Without Downtime

Kubernetes Upgrades Without Downtime

2 min read
The Dashboard Audit: Finding and Killing Dead Metrics

The Dashboard Audit: Finding and Killing Dead Metrics

2 min read
Cost Attribution in Shared Infrastructure

Cost Attribution in Shared Infrastructure

2
2 min read
How We Killed Our Worst Alert (And What We Learned)

How We Killed Our Worst Alert (And What We Learned)

2 min read
The Reliability Roadmap: A 90-Day Plan for New SRE Teams

The Reliability Roadmap: A 90-Day Plan for New SRE Teams

2 min read
Scaling On-Call When You Only Have 5 Engineers

Scaling On-Call When You Only Have 5 Engineers

2 min read
TLS Certificate Management Without Tears

TLS Certificate Management Without Tears

2 min read
DNS: The SRE's Most Underrated Skill

DNS: The SRE's Most Underrated Skill

2 min read
The Silent Outage: Monitoring What You Can't See

The Silent Outage: Monitoring What You Can't See

2 min read
Why Every SRE Should Learn a Little Rust

Why Every SRE Should Learn a Little Rust

2 min read
How We Built Our Own Incident Management System

How We Built Our Own Incident Management System

2 min read
The Role of Platform Engineering in a Startup

The Role of Platform Engineering in a Startup

2 min read
Building Dashboards People Actually Use

Building Dashboards People Actually Use

2 min read
SRE Maturity Models: Where Is Your Team?

SRE Maturity Models: Where Is Your Team?

2 min read
The Art of Writing a Good Post-Mortem

The Art of Writing a Good Post-Mortem

1 min read
loading...