Author:: Flavius Dinu
Published:: May 26, 2026
Category:: Tutorials

The Math Behind Why Your Multi-Step AI Agentic Workflow Fails in Production

TL;DR:

Agent reliability compounds: a 95% per step accuracy, gives only ~60% success over 10 steps, and 36% over 20 steps
Demos usually hide this because they only show 2 or 3 steps. Production environments are usually 5+ steps over messy inputs and edge cases
The fix is shorter chains, verification between steps, human-in-the-loop for risky action, and guardrails to reduce the blast radius
Lens Agents reduces the blast radius significantly by enforcing identity, policy, and audit at the platform layer

Check the video:

The simple math for reliability

Lusser’s law in reliability engineering is pretty straightforward. The reliability of a series of components is equal to the product of their individual reliabilities. For example, if each step of an agent’s workflow is independent and succeeds with a probability p, the probability that an n-step task succeeds end-to-end is p^n.

If we have a step that succeeds with probability 80% and we run a 3-step workflow, the probability of success is 0.8^3, or 0.5.

Let’s explore some more optimistic scenarios:

99% accuracy per step: 5 steps (~95%), 10 steps (~90%), 20 steps (~81%), 50 steps (~60%)
95% accuracy per step: 5 steps (~77%), 10 steps (~59%), 20 steps (~35%), 50 steps (~7%)

A 95% accuracy per-step rate is a very good scenario in practice, but when you look at a workflow with 10 steps, your agentic workflow will fail half the time, and at 50 steps, it’s a coin flip: it succeeds only if the coin stays on the edge.

Note: In reality, steps aren’t independent, so the numbers are actually worse than this. An incorrect output at step 2 will be fed to step 3, which will break absolutely everything else. Compounding is actually worse in reality than the raw multiplication suggests.

Why do demos lie?

Demos are optimized for the happy path in which you have clean data, short chains, no rate limits, no ambiguous inputs, and the list goes on. You have three steps, everything works perfectly, leadership is happy, and you start implementing a similar production workflow.

But production is a different kind of beast. If you need, for example, to build a workflow that researches a particular incident and drafts a root cause analysis, your agent will actually go through multiple steps:

Parsing the alert payload
Query Prometheus for metrics around when the incident happened
Pull pod logs from the affected namespaces
Understand how were does pods are created (by a deployment, statefulset, daemonset, standalone, etc)
Correlate with recent deployments
Check AlertManager for related firing alerts
Check what solutions were applied in order to fix it (pull recent GitHub commits)
Draft the RCA
Post it to Slack/MS Team/Confluence
Create a Jira/Azure DevOps/etc ticket

You will have at least 10 tool calls, and every one of them is a place where your agent can pick the wrong tool, pass wrong parameters, hallucinate a namespace that doesn’t exist, or misread a metric. Even at a very optimistic 95% per step accuracy, this workflow fails about 40% of the time.

These failures can also be dangerous. The agent completes the workflow successfully and produces a plausible-looking RCA, but nothing is actually correct. Your engineering teams will need to debug the debugger.

A hard failure is easy because you can retry the workflow, and you know that something went wrong. A soft failure shows you that everything is okay, but in reality, the output is wrong. In addition to that, you might also face context drift, where your agents start subtly reinterpreting the goal based on whatever’s most recent in its context window.

So what can you actually do?

The models are good enough to do these tasks; it’s not about waiting for a newer and more powerful model. There are four essential things you should implement:

Shorten the multi-step workflow: Every step you remove from your workflow is a multiplicative win. Combine steps where possible and use deterministic code for parts that don’t need reasoning
Verify between steps: Don’t blindly feed the output of a crucial step to its successor. Check intermediate result against validation rules
Use human-in-the-loop: For risky actions, let your agents propose, but a human should approve what they propose.
Scope the blast radius at the platform level: Even if the agent is wrong, the damage should be bounded by your governance layer in the platform: Identity, RBAC, Policies

You can’t make your agents 100% reliable. What you can do is limit the blast radius when they are wrong.

Reducing the blast radius with Lens Agents

An AI agent that operates in your cloud account or in your Kubernetes clusters makes the compounding error problem even bigger, and you need something to contain it.

This is where Lens Agents come in. Lens Agents is a governed platform for running AI agents on enterprise systems. You can’t make your agent perfect, but you can make it contained, as Lens Agents offers:

Identity-bound execution: Every agent has its own identity, and when something goes wrong at the n-step of your workflow, the audit trail tells you exactly who the agent was, and what it has done
Policy engine: Every agent runs under a declarative policy, and you can restrict what domains they can access and what they can actually do
Credential bindings: The agent never sees raw credentials, as they are injected by the Lens Agent relay only when a request hits an allowed domain
Built-in cost control: You can build spending limits by organization, team, or agent
MCP native: Connect natively to AWS, Kubernetes, and any MCP-compatible tool
Built-in audit trail: Every tool call and every decision path is captured

Conclusion

The compounding error problem is real, and at a 95% per-step accuracy, your Agentic workflow will fail most 20-step tasks, and there is no amount of prompt engineering that can fix that.

Teams that are doing agentic workflow successfully treat reliability as a systems problem by leveraging shorter chains, verification between steps, and hard guardrails at the platform level

If you want to run AI agents in your production environment without the blast radius keeping you up at night, sign up for Lens Agents early access here.