The Math Behind Why Your Multi-Step AI Agentic Workflow Fails in Production
TL;DR:
- Agent reliability compounds: a 95% per step accuracy, gives only ~60% success over 10 steps, and 36% over 20 steps
- Demos usually hide this because they only show 2 or 3 steps. Production environments are usually 5+ steps over messy inputs and edge cases
- The fix is shorter chains, verification between steps, human-in-the-loop for risky action, and guardrails to reduce the blast radius
- Lens Agents reduces the blast radius significantly by enforcing identity, policy, and audit at the platform layer
Check the video:
The simple math for reliability
Lusser’s law in reliability engineering is pretty straightforward. The reliability of a series of components is equal to the product of their individual reliabilities. For example, if each step of an agent’s workflow is independent and succeeds with a probability p, the probability that an n-step task succeeds end-to-end is p^n.
If we have a step that succeeds with probability 80% and we run a 3-step workflow, the probability of success is 0.8^3, or 0.5.
Let’s explore some more optimistic scenarios:
- 99% accuracy per step: 5 steps (~95%), 10 steps (~90%), 20 steps (~81%), 50 steps (~60%)
- 95% accuracy per step: 5 steps (~77%), 10 steps (~59%), 20 steps (~35%), 50 steps (~7%)
A 95% accuracy per-step rate is a very good scenario in practice, but when you look at a workflow with 10 steps, your agentic workflow will fail half the time, and at 50 steps, it’s a coin flip: it succeeds only if the coin stays on the edge.
Note: In reality, steps aren’t independent, so the numbers are actually worse than this. An incorrect output at step 2 will be fed to step 3, which will break absolutely everything else. Compounding is actually worse in reality than the raw multiplication suggests.
Why do demos lie?
Demos are optimized for the happy path in which you have clean data, short chains, no rate limits, no ambiguous inputs, and the list goes on. You have three steps, everything works perfectly, leadership is happy, and you start implementing a similar production workflow.
But production is a different kind of beast. If you need, for example, to build a workflow that researches a particular incident and drafts a root cause analysis, your agent will actually go through multiple steps:
- Parsing the alert payload
- Query Prometheus for metrics around when the incident happened
- Pull pod logs from the affected namespaces
- Understand how were does pods are created (by a deployment, statefulset, daemonset, standalone, etc)
- Correlate with recent deployments
- Check AlertManager for related firing alerts
- Check what solutions were applied in order to fix it (pull recent GitHub commits)
- Draft the RCA
- Post it to Slack/MS Team/Confluence
- Create a Jira/Azure DevOps/etc ticket
You will have at least 10 tool calls, and every one of them is a place where your agent can pick the wrong tool, pass wrong parameters, hallucinate a namespace that doesn’t exist, or misread a metric. Even at a very optimistic 95% per step accuracy, this workflow fails about 40% of the time.
These failures can also be dangerous. The agent completes the workflow successfully and produces a plausible-looking RCA, but nothing is actually correct. Your engineering teams will need to debug the debugger.
A hard failure is easy because you can retry the workflow, and you know that something went wrong. A soft failure shows you that everything is okay, but in reality, the output is wrong. In addition to that, you might also face context drift, where your agents start subtly reinterpreting the goal based on whatever’s most recent in its context window.
So what can you actually do?
The models are good enough to do these tasks; it’s not about waiting for a newer and more powerful model. There are four essential things you should implement:
- Shorten the multi-step workflow: Every step you remove from your workflow is a multiplicative win. Combine steps where possible and use deterministic code for parts that don’t need reasoning
- Verify between steps: Don’t blindly feed the output of a crucial step to its successor. Check intermediate result against validation rules
- Use human-in-the-loop: For risky actions, let your agents propose, but a human should approve what they propose.
- Scope the blast radius at the platform level: Even if the agent is wrong, the damage should be bounded by your governance layer in the platform: Identity, RBAC, Policies
You can’t make your agents 100% reliable. What you can do is limit the blast radius when they are wrong.
Reducing the blast radius with Lens Agents
An AI agent that operates in your cloud account or in your Kubernetes clusters makes the compounding error problem even bigger, and you need something to contain it.
This is where Lens Agents come in. Lens Agents is a governed platform for running AI agents on enterprise systems. You can’t make your agent perfect, but you can make it contained, as Lens Agents offers:
- Identity-bound execution: Every agent has its own identity, and when something goes wrong at the n-step of your workflow, the audit trail tells you exactly who the agent was, and what it has done
- Policy engine: Every agent runs under a declarative policy, and you can restrict what domains they can access and what they can actually do
- Credential bindings: The agent never sees raw credentials, as they are injected by the Lens Agent relay only when a request hits an allowed domain
- Built-in cost control: You can build spending limits by organization, team, or agent
- MCP native: Connect natively to AWS, Kubernetes, and any MCP-compatible tool
- Built-in audit trail: Every tool call and every decision path is captured
Conclusion
The compounding error problem is real, and at a 95% per-step accuracy, your Agentic workflow will fail most 20-step tasks, and there is no amount of prompt engineering that can fix that.
Teams that are doing agentic workflow successfully treat reliability as a systems problem by leveraging shorter chains, verification between steps, and hard guardrails at the platform level
If you want to run AI agents in your production environment without the blast radius keeping you up at night, sign up for Lens Agents early access here.

