infra career path

How to become a Site Reliability Engineer in 2026

On-call ownership for service reliability — incidents, SLOs, and production excellence.

Mid salary (US): $165k
Mid salary (India): ₹38L
Time to ready: 18 months
Hours / week: 12h

What does a Site Reliability Engineer do?

SREs apply software engineering to operations. The day-to-day is a mix of incident response (during on-call rotations), reliability tooling (writing software to reduce toil), and capacity planning. The 2026 SRE archetype owns SLOs and error budgets, runs game days, and is the org's production-reliability conscience. The role requires emotional steadiness — you're the person paged at 3am for a P0. Compensation is high and rising because AI workloads have made production reliability harder (bursty model latency, third-party API dependencies, expensive failures).

A typical day

Lead the incident response for a P1 — the database is taking 5s to respond
Write the postmortem for last week's outage — root cause and 3 follow-up actions
Run a chaos-engineering game day with the backend team
Build a new dashboard that surfaces error-budget burn rate per service
Pair with a developer on a tricky deploy pattern that avoids 5xx spikes

Step-by-step roadmap

3 phases. Plan ~18 months at 12h/week.

Reliability fundamentals

Strong systems fundamentals — Linux internals, networking, container orchestration. SRE is software engineering applied to operations; both halves matter.

~4 mo

Skills to learn

linuxdockerkubernetes

Milestones

Diagnose a Linux performance issue with strace/perf
Run a 3-node Kubernetes cluster and survive a node failure
Read the Google SRE book end-to-end

Observability + incidents

Metrics/logs/traces, SLOs and error budgets, on-call rotation hygiene, and the muscle for incident response.

~4 mo

Skills to learn

monitoringincident responsedistributed systems

Milestones

Define SLOs for one service and set up burn-rate alerts
Lead one game day or chaos-engineering exercise
Write a postmortem that drives a real engineering change

Toil reduction

Building software to reduce operational toil — automation, self-healing systems, capacity planning, and the leverage work of senior SRE.

~4 mo

Skills to learn

pythongoterraform

Milestones

Automate one recurring operational task — measure hours saved
Ship one tool that other teams adopt
Run capacity planning for one service through one full quarter

Unlock all 3 phases — free

See the full Site Reliability Engineer roadmap, milestones, and the AI Career Tutor.

You'll unlock:Full multi-phase roadmap, milestone checklists, AI tutor, skill-gap analysis against your resume, and personalized job matches.

Why this role matters in 2026

AI workloads make production reliability harder. SREs who can run incident response for AI-heavy stacks (LLM timeouts, embedding drift, cost runaways) are in short supply.

Hands-on projects

8 curated 2026 projects to build your portfolio.

See all →

Advanced ~28h

Custom Observability Pipeline

Build a logs + metrics + traces pipeline using OpenTelemetry, Tempo, Loki, and Grafana. Self-host the whole thing.

OpenTelemetryTempoLokiGrafana +1

Intermediate ~24h

Multi-Account Terraform Setup

Set up Terraform for a multi-account AWS or GCP organization. Practice IaC modules, state management, and policy.

Terraform or OpenTofuAWS or GCPGitHub Actionstflint

Intermediate ~14h

SLOs and Error Budgets for a Real Service

Define and implement SLOs + error budgets for an existing service. Drive a real reliability conversation.

Prometheus or DatadogGrafanaService mesh or proxy for SLI collection

Advanced ~20h

Chaos Engineering Game Day

Plan and run a chaos engineering exercise — what breaks when X fails? Document findings and fix the top 3.

LitmusChaos or Chaos MeshKubernetesGrafana

Related career paths

Roles that share >40% of the same skills — easy lateral moves.

Devops Engineer

View roadmap →

Backend Engineer

View roadmap →

Security Engineer

View roadmap →