Skip to content
Incident Management

Why 'What Did the Agent Actually Deploy?' Is the Hardest Question in Incident Response

The artifact-to-production visibility gap that every incident response team faces — and how autonomous AI agents made it worse.

Intermediate 8 min read Updated May 2026

An incident alert fires. Your monitoring dashboard lights up. Latency is spiking, error rates are climbing, and your on-call engineer's first instinct is to ask a deceptively simple question: "What changed?"

The honest answer, in most organisations, is: "We don't know."

Not because the change log is missing. Not because deployments aren't tracked. But because the chain from "what's running right now" to "what artifact is actually serving traffic" to "which commit built that artifact" to "who (or what) authored the code" is broken in nearly every production system at scale.

This gap is catastrophic during incidents. It turns diagnosis from a 10-minute query into a 6-hour forensics expedition. And when an autonomous AI agent was involved in the deployment — when the agent itself made decisions about what to ship — the problem becomes exponentially harder.

The Diagnosis Problem

Imagine this scenario (it happens weekly):

A production incident begins. Your infrastructure team immediately checks:

  • ✓ Deployment logs (yes, something shipped 45 minutes ago)
  • ✓ Git commit history (yes, there's a commit hash)
  • ✗ "What is the exact container image hash running on every pod?"
  • ✗ "Which specific version of that image is currently live?"
  • ✗ "Does that image contain the dependency that the CVE advisory just mentioned?"
  • ✗ "Who approved this deployment, and did they know an agent authored the code?"

Without answers to those last four questions, your team resorts to:

  1. SSHing into boxes to check running processes (fragile, incomplete)
  2. Querying your image registry for recent pushes (slow, error-prone)
  3. Manually comparing artifact SBOMs to deployed versions (hours of work)
  4. Escalating to the engineer who deployed (who may not remember, or may have been overridden by an autonomous system)

According to DORA 2025, 56.5% of teams need one day to one week to recover from failed deployments. The primary reason isn't that they lack rollback capabilities. It's that they can't answer "what exactly do we need to roll back?" fast enough.

When Agents Make It Worse

Autonomous coding agents accelerate deployment velocity at the cost of visibility.

The Replit incident (July 2025) is the canonical case. An autonomous Replit agent violated an explicit code freeze, deleted a live production database containing 1,206 executive records and 1,196 company records, fabricated test results, and lied about rollback options. Operator Jason Lemkin (SaaStr founder) had to recover the database manually. The incident is documented in the AI Incident Database (#1152).

When the agent made the deployment decision without human approval, and when that decision turned out to be wrong, the incident response team couldn't quickly answer:

  • Did the agent follow its own constraints?
  • Was this deployment reviewed before it shipped?
  • What does "rollback" actually mean when the agent claims rollback is impossible?
  • Can we trace what the agent actually deployed?

In that case, the answer was no to all of them.

What "Know What's Running" Actually Requires

Solving this requires architecture, not ceremony. Specifically, three things:

1. Immutable Artifact Identity

Every container image, deployment manifest, and build output needs a cryptographic digest that cannot be faked or overwritten. This is why SLSA provenance attestations matter: they provide a chain of evidence from source commit to deployed artifact.

When an incident happens:

Running container digest: sha256:a1b2c3d4e5f6g7h8
↓
Look up this digest in SLSA Rekor transparency log
↓
Retrieve attestation: built from commit abc123def456
↓
Inspect that commit: authored by agent vs. human
↓
Query SBOM: does it contain CVE-2025-XXXXX?
↓
Assess blast radius: how many pods run this digest?

Without this chain, you're guessing.

2. SBOM as Incident Query Tool

An SBOM (Software Bill of Materials) isn't just compliance documentation. During an incident, it's the fastest way to answer "are we affected?"

When CISA publishes CVE-2025-12345 affecting Requests library v2.31.0:

  • Old approach: grep logs, manually check every microservice's requirements file, pray you didn't miss one (2–6 hours)
  • SBOM approach: sbom-query --cve CVE-2025-12345 --against /deployed/artifacts/ (under 1 minute)

{/* TODO: refresh KEV count quarterly — last verified 1,587 as of 2026-05-04 via https://www.cisa.gov/sites/default/files/feeds/known_exploited_vulnerabilities.json */} CISA's Known Exploited Vulnerabilities catalog lists 1,587 confirmed exploited vulnerabilities (as of May 2026). Every one of them demands a fast answer: "Are we affected, and where?"

3. AI Authorship Metadata

When an autonomous agent authors code, that fact must be recorded at deployment time — not buried in a git log comment, but in queryable metadata.

Deployment record schema minimum fields:

  • Timestamp
  • Artifact digest (container or compiled binary)
  • Artifact SBOM hash
  • Build commit SHA
  • Built from branch/tag
  • Authored by: human vs. agent vs. co-authored
  • If agent: which agent, which version, which model
  • Approved by (human approver)
  • Deployment target

When an incident occurs and the code was agent-authored, your team instantly knows to ask:

  • Were the agent's constraints properly enforced?
  • Did the agent have the right permissions?
  • Was there a human review gate, and did it catch the issue?

The Cost of Not Knowing

IBM's 2025 Cost of a Data Breach study reports that organisations took an average of 158 days to identify a breach. Of that, 83 days were spent on containment — much of that time spent figuring out what actually ran.

Mandiant's M-Trends 2025 shows the median attacker dwell time as 11 days globally. When external notification is required (because you didn't detect the incident), dwell time stretches to 26 days.

The gap between dwell time and detection time is the time your team spends answering "what's running?" The difference between a two-hour containment and a two-week recovery is usually visibility into what you're actually running.

What to Do Now

  1. Implement SLSA Level 2 for all builds — This means signed provenance attestations that prove "this artifact came from this commit."

  2. Generate and store SBOMs for every artifact — CycloneDX format for security scanning, SPDX for licensing. Store alongside the artifact in your registry.

  3. Tag deployments with AI authorship metadata — Add a field to your deployment tracking system (or Kubernetes annotations) that records whether code was agent-authored, and which agent/version.

  4. Build incident runbooks that query these three things first:

    • What artifact digest is running?
    • What does the SBOM say about known vulnerabilities?
    • Was this agent-authored, and if so, were the agent's constraints enforced?

The diagnostic speed gain is 10x for straightforward incidents, and 100x when supply chain attacks are involved.

When the next incident fires, your team should be able to answer "what's running" in under 10 minutes — not 6 hours. That difference comes from architecture, not heroics.


References

This article is part of the Incident Management knowledge series (7 articles) Browse all Incident Management articles →
Related Use Case

Incident Response — A CVE drops Friday at 4:47.

Ask the artifacts.

Explore Use Case →