What is an SRE?

A Site Reliability Engineer is the person responsible for keeping production systems running at scale, using code, measurement, and automation instead of manual operations. The role was created at Google in 2003 by Ben Treynor Sloss, and the cleanest definition still comes from him.

"SRE is what happens when you ask a software engineer to design an operations team." — Ben Treynor Sloss

What an SRE actually does

Chapter 1 of Google's SRE book names eight areas of responsibility: availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning. A few practices inside that list define what makes SRE different from operations as it was practiced before.

Service Level Objectives. SREs define what "reliable enough" means in measurable terms: a target latency, an availability percentage, an acceptable error rate. An SLO makes "is the system healthy?" a number anyone can check. The underlying measurement that feeds the SLO is a Service Level Indicator (SLI); a Service Level Agreement (SLA) is the contract that promises an SLO to a customer, with consequences if it is missed.

Error budgets. The most original idea in the discipline begins with a claim that sounds wrong: 100% reliability is the wrong target for almost any service. If your SLO is 99.9% availability, your error budget is the remaining 0.1%, about 43 minutes of downtime per month. The budget formalizes how much downtime the business is willing to tolerate in exchange for development velocity. While it holds, releases can move faster and the team can tolerate less testing. Exhaust it and launches freeze while the team focuses on reliability work until the budget refreshes. This single mechanism aligns engineering incentives with reliability without anyone needing to play bad cop.

Monitoring. Chapter 6 of the SRE book names the Four Golden Signals as the core symptoms of user-facing problems: latency, traffic, errors, and saturation. It also splits monitoring output into three categories: pages that demand a human now, tickets that need eventual attention, and email alerts that accumulate unread. The job of monitoring is to keep each signal in the right category. An SRE team buried in pages for things that should have been tickets has a monitoring problem before it has a reliability problem.

Emergency response. When something breaks, SREs run the investigation. They define the runbooks, carry the pager, write the postmortems, and own the corrective actions. Google's framing is explicitly blameless. Roughly 70% of outages, Google has found, come from changes to a live system, which is why an SRE will reach for a rollback before they understand the cause.

Toil reduction. The enabling principle that makes the rest possible: no more than 50% of an SRE's time should go to toil, the manual, repetitive, automatable work that scales linearly with the system and keeps the lights on without moving anything forward. The remaining 50% goes to engineering work that removes future toil: tooling, automation, platform improvements, and the observability that makes future incidents legible in the first place. When a team consistently spends more than half its time on toil, Google's policy is to push some of that work back to the development team rather than let SRE absorb it indefinitely.

Capacity planning, performance, efficiency. The unglamorous half of the job: forecasting demand, sizing infrastructure, finding the 20% of resources that account for 80% of cost. Less photogenic than incident response, but in most production environments it is where the biggest reliability wins live.

Notice what is not on this list. SREs do not operate servers by hand, chase tickets, or exist to absorb pain so developers do not have to. The point of the discipline is that operations problems are software problems, and software problems get solved with software.

Where SRE came from

In 2003, Ben Treynor Sloss joined Google with a mandate that sounded simple and turned out to be a whole new discipline: keep production running at planet scale, without growing the operations team linearly with the system. The traditional answer would have been to hire more sysadmins. Sloss did something different. He hired software engineers, gave them an operations problem, and let them solve it the way software engineers solve everything else: with code, measurement, and a deep allergy to manual repetition.

He first articulated the model publicly in 2014. By 2016, the team had grown from seven engineers to more than a thousand. Two years later, Google published the playbook as a book, Site Reliability Engineering: How Google Runs Production Systems, and made the full text free online. The discipline now runs at Airbnb, Dropbox, IBM, LinkedIn, Netflix, Wikimedia, and most large engineering organizations that take production seriously.

Why SRE matters

Organizations adopt SRE because the alternatives stop scaling.

Reliability becomes a measurement. Without an SLO, "is the system healthy" is whoever has the loudest opinion in the room. With one, the team can disagree about a lot of things, but not about whether customers are getting the experience they were promised.

Dev and ops stop fighting structurally. Without an error budget, every release is a negotiation between developers who want to ship and operators who want stability. With one, both sides see the same number, and the question becomes arithmetic rather than political.

Operational knowledge survives turnover. Postmortems, runbooks, and the bias toward writing things down convert what used to live in three senior engineers' heads into something the rest of the team can read. The senior engineer can take vacation; the team's memory does not leave with them.

SRE vs DevOps

DevOps is a cultural movement. It says that the people who build the software should also operate it, that development and operations should not be separate departments throwing artifacts over a wall, and that fast feedback loops beat ceremony. It is a philosophy of how teams should work.

SRE is a job. It is the concrete, measurable, code-first implementation of that philosophy inside a specific role. The SRE book frames it as "a specific implementation of DevOps with some idiosyncratic extensions." Practitioners have a sharper version: class SRE implements DevOps.

How SRE teams are structured

The same role can sit differently inside different organizations. The patterns that show up most in practice:

Embedded. An SRE, or a small pair, sits inside a software engineering team, owning reliability for that team's services. It is most common in mature engineering orgs where reliability is a shared concern.

Infrastructure or platform. A central SRE team owns shared systems: observability stacks, deployment pipelines, the Kubernetes cluster everything else runs on. They are the platform every product team builds against.

Product or application. A dedicated SRE team owns reliability for a specific product, usually one large enough that the product engineering team cannot credibly own production by itself.

Consulting. SREs advise product teams on reliability practices without owning operations directly. Google internally calls these "Customer Reliability Engineers."

Most large organizations end up with a hybrid: a central platform team owning shared infrastructure, embedded SREs inside critical product teams, and consulting SREs floating across the rest.

The bottom line

SRE treats reliability as a measurable, budgetable, engineerable property of the system, owned by software engineers who refuse to solve operations problems by hand. The alternative is pager-hero culture and a wall of dashboards.

What is an SRE?

What an SRE actually does

Where SRE came from

Why SRE matters

SRE vs DevOps

How SRE teams are structured

The bottom line

Keep exploring

The AI Treadmill: Why Keeping Up Is the Real Engineering Challenge

Claude Code Is Not an SRE Agent

Product