Claude Code Is Not an SRE Agent

Claude Code Is Not an SRE Agent
Why reading logs at I/O speed is still not the same as understanding production.
Everyone wants the same story right now: if AI can write code, surely the next step is that it can run the systems that code creates. Anthropic’s own reliability team just offered a much more useful reality check. At QCon London, Alex Palcuie described Claude as genuinely helpful during incidents, but still a poor substitute for an SRE. Anthropic’s own Site Reliability Agent cookbook quietly says the same thing in architectural form: once you move from demo to production, you do not just need a model. You need scoped tools, safety boundaries, runbooks, knowledge capture, and human approval around the model.
That distinction matters because the market is drifting toward the wrong abstraction. We keep treating “AI that can code” as if it naturally extends to “AI that can run production.” It does not. Palcuie described incident response as a loop of observe, orient, decide, and act, while explicitly saying AI is fantastic at the observation part. That framing is the whole story. Production incident response is not a log-reading contest. The hardest part is not seeing more data. The hardest part is deciding what the data means.
Observation Is Not Diagnosis
This is where the difference becomes painfully obvious.
Claude is extremely good at observation. In Palcuie’s examples, it could move through evidence at machine speed, query data quickly, and surface patterns that would have taken a human much longer to find. In one incident, it helped distinguish a cluster of HTTP 500s from what turned out to be an abuse or fraud pattern. That is not trivial. At production scale, the ability to read logs “at the speed of I/O” is a real advantage.
But the more important example is the negative one. During KV cache incidents, Claude repeatedly saw rising request volume and concluded that the system needed more servers. The visible symptom was real. The conclusion was wrong. The actual issue was the broken cache, not a simple capacity shortfall. That is the exact failure mode that makes many AI-for-ops demos look more capable than they are: they confuse correlation with causation, then wrap the mistake in a confident, readable explanation. Palcuie made the point even more directly on postmortems: the model can produce a persuasive story while still being bad at identifying the true root causes.
At Hyground, we think this is the line the industry needs to draw much more clearly. AI is already very good at the observation layer. It is good at searching, correlating, summarizing, and narrowing the search space. Root cause analysis is something else. It is hypothesis management under uncertainty. It is deciding which signals are upstream, which ones are downstream, and which ones are just noise. That is why the right product goal is guided investigation, not autonomous certainty. Anthropic’s own cookbook reflects that same logic: the agent is most effective when it can synthesize across metrics, logs, alerts, and configuration, while keeping remediation inside a structured human-in-the-loop workflow.
Root Cause Lives in History
Palcuie’s most important point may be even simpler: models do not know the history of your system.
And in real production environments, history is half the diagnosis. The alert threshold that was relaxed three years ago because of a migration. The service that still depends on an old ownership model no one fully untangled. The config workaround that became permanent. The incident that everyone remembers, but nobody documented properly. None of that lives in today’s logs. None of that is obvious from the current dashboard. Yet all of it shapes the meaning of the current failure. Palcuie said this directly: Claude does not know the history of your system, especially when that system has been around for ten years.
This is why knowledge preservation is not a nice-to-have layer on top of incident AI. It is part of the core system. Anthropic’s own SRE cookbook ends up in exactly that place. The generic agent becomes materially more useful once it can follow runbooks, encode institutional procedures as skills, search prior postmortems, and write new postmortem pages into Confluence. That is not “extra context.” That is the missing operational memory that turns a plausible story into a useful investigation.
It is also why we designed Hyground the way we did. Our platform is built to run inside the customer’s environment, connect to operational systems like Prometheus, Loki, and Kubernetes, and work with the existing toolchain and knowledge base rather than behave like an isolated chatbot. Hyground’s product language already reflects this: natural-language access to infra data, living documentation that reads and writes team knowledge, and a guided assistant that investigates multi-signal events inside the environment where the incident actually happens.
Jevons Does Not Care About Your Demo
The strategic takeaway from Palcuie’s talk is not just that models still struggle with causality. It is that the demand side of operations is likely to grow, not shrink.
He explicitly invokes Jevons Paradox: when technology makes something cheaper, we often end up doing more of it, not less. In software, that means AI makes it easier to write code, so organizations write more code, create more services, increase complexity, and end up with more interesting failures. The result is not a world with less on-call. It is a world where the surface area for incidents expands faster than teams can manually reason about it.
This is the part many AI narratives still miss. AI-assisted development is not just a productivity story. It is also a complexity story. Every gain in generation speed can translate into more services, more dependencies, more deploys, more hidden coupling, and more chances to discover that a system was only “working” because nobody had stressed it in exactly this way before. That is why the market for operational intelligence is getting bigger at the same time AI coding tools are getting better. The tools that accelerate change are also increasing the need for tools that can safely understand change in production.
Do Not Let Scar Tissue Evaporate
Palcuie also raised a concern that every engineering leader should take seriously: skill atrophy.
He said good SREs carry scar tissue. That is exactly the right phrase. Great incident responders are not just people who know where the logs are. They are people who have seen which dashboards lie, which symptoms repeat, which rollbacks are safe, which “obvious fixes” create a second incident twenty minutes later, and which systems only look independent on the architecture diagram. That scar tissue is expensive to build and easy to lose.
The right role for AI is not to replace that scar tissue. It is to preserve it, spread it, and make it more accessible to the rest of the organization. Let the model do the boring but high-volume work: sift logs, compare deploy history, correlate alerts, summarize state, draft the first postmortem. Let humans own judgment, escalation, tradeoffs, and action under risk. Even Anthropic’s own setup separates investigation from remediation and treats the boundary between read-only analysis and write access as a first-class design decision. That is teammate design, not replacement design.
Anthropic’s Architecture Quietly Validates the Right Design
There is a second signal buried in Anthropic’s own materials, and it is probably the most useful one for builders in this space.
Their official Site Reliability Agent is not “just Claude Code.” It is an architecture. It uses MCP-connected tools for metrics, logs, configs, alerts, and deployment history. It scopes access with restricted directories, command allowlists, and validation hooks. It separates investigation from remediation. It supports runbooks and postmortem workflows. It extends into tools like PagerDuty and Confluence. In other words: even Anthropic does not treat operational AI as a model alone. They treat it as a model embedded in a carefully structured operational system.
That independently validates the direction we believe matters most. The product moat in AI for operations is not “our model is magical.” It is whether the system is connected to the right evidence, whether it can retrieve institutional knowledge, whether it can structure an investigation safely, and whether it can act inside the right approval boundaries. Hyground’s current architecture points in exactly that direction as well: cluster-resident deployment, local data processing, integration into the existing toolchain, living documentation, and guided investigation instead of blind autonomy.
The Real Opportunity
So yes: Claude Code is not an SRE agent.
But that is not an indictment of Claude Code. It is a statement about the category. SRE work is not coding with different inputs. It is evidence-weighting under uncertainty inside systems shaped by years of technical decisions, organizational compromises, and accumulated operational memory. A model that reads faster is helpful. A system that can investigate safely, retrieve history, preserve team knowledge, and collaborate with humans is what operations teams actually need.
The winners in this market will not be the companies promising to fire the SRE team. They will be the ones that make every engineer more effective in the first fifteen minutes of an incident, preserve what the team learns in the fifteen hours after it, and free the best SREs to work on harder reliability problems. Anthropic’s own experience is one of the clearest public signals yet that this is where the industry is actually heading. If your AI can read logs but cannot separate symptoms from causes, retrieve system history, and operate inside safe workflows, you do not have an SRE agent. You have a very fast observer.
The latest from our team
Explore stories on DevOps, AI, and enterprise security
Ready to transform your operations
See how Hyground reduces incident response time and strengthens your security posture


