Why AI Agents Don’t Work (Yet): Reliability, Evaluation, and the Real Job of AI Engineering
- Staff Desk
- 2 days ago
- 9 min read

The current wave of excitement around AI agents is intense. In products, research labs, and startups, the expectation is that “agents” will be the bridge from today’s large language models to systems that can actually act in the world: writing code, automating workflows, conducting research, or even behaving like general-purpose digital employees.
At the same time, many of the most ambitious agent products have failed to live up to their claims. The gap between polished demos and real-world reliability is still large. Understanding why that gap exists is critical for anyone serious about AI engineering.
This article explores three central reasons AI agents “don’t work very well” today and what needs to change:
Evaluating agents is hard and often done poorly.
Static benchmarks and leaderboard culture are misleading for real-world performance.
The field confuses capability with reliability, and reliability is the real bottleneck.
The central claim is simple: making agents truly useful is less a modeling problem and more a reliability engineering problem.
1. What AI Agents Are (and What They Are Not)
There is no single canonical definition of an AI agent, but a useful description is:
An AI agent is a system, typically built around a language model, that controls the flow of a process: it perceives inputs, makes decisions, calls tools or services, and takes actions toward a goal, often in a loop and with minimal human intervention.
Architecturally, an agent often operates in a cycle:
Perceive
Read user input, system state, files, or external signals.
Consult Memory
Retrieve past steps, logs, or external knowledge.
Reason
Decide what to do next, possibly planning multiple steps ahead.
Act
Call tools, APIs, sub-agents, or perform code execution.
Observe
Inspect results of the actions and feed them back into the loop.
This loop can repeat many times. In multi-agent systems, multiple agents operate at the application level, each with specialized roles, communicating and coordinating to achieve a larger goal.
In that sense, even today’s familiar tools like chat-based assistants with tool use can be seen as rudimentary agents: they filter inputs, call tools, and return structured outputs.
More advanced products now expose fully fledged agent behaviors:– Web-based operators that complete open-ended tasks online– Research tools running long multi-step workflows to generate reports
So agents are already in use. But the most ambitious visions of agents as fully capable autonomous coworkers, personal AI companions, or self-directed research scientists have repeatedly failed in practice.
Understanding why starts with evaluation.
2. When Agents Fail in the Real World: The Evaluation Problem

2.1 Rushed claims and legal consequences
One prominent example is a startup that claimed to automate the work of a lawyer. It advertised itself as an “AI lawyer,” even offering to argue in court via an earpiece. Eventually, regulators fined the company for making false performance claims; its system did not perform at the level suggested by its marketing.
This is not just a story about aggressive hype. It illustrates a deeper issue: claims about agent capabilities were not grounded in rigorous, transparent evaluation.
2.2 “Hallucination-free” products that hallucinate
This problem is not limited to small startups. Major legal-tech platforms released products marketed as “hallucination-free” legal research tools. Independent academic evaluations later showed that:
In roughly one-third of tested cases, some systems hallucinated.
In at least one-sixth of cases, hallucinations were severe, including:
Completely reversing the intent of legal texts
Fabricating supporting paragraphs or citations
Around 200 concrete examples were documented. The gap between the claimed reliability and the observed behavior was substantial.
2.3 “AI scientists” that cannot reliably reproduce papers
Another frequently promoted vision is that agents will soon automate all of scientific research. One startup claimed to have built an “AI research scientist” capable of end-to-end scientific discovery.
To test such claims in a grounded way, researchers constructed a benchmark focused on reproducibility:
The benchmark, CoreBench, does not require open-ended discovery.
Tasks are limited to reproducing results from existing papers, with code and data already provided.
The agent’s job is to run and adjust what is needed to match the reported results.
Even under these much simpler conditions, leading agents successfully reproduced fewer than 40% of papers. This is a meaningful achievement, but it falls far short of “automating all of science.” It demonstrates that real-world claims can be dramatically ahead of what evaluations actually support.
Further analysis of the “AI scientist” behavior revealed:
Tasks were often toy problems, far from real-world research complexity.
Evaluations used an LLM as a judge rather than true peer review.
Generated “discoveries” were minor tweaks on existing work, closer to undergraduate projects than breakthrough science.
2.4 CUDA kernel “optimizations” that break physics
In another case, the same group claimed an agent system that could optimize CUDA kernels by up to 150× beyond standard baselines. On closer inspection, the reported performance implied the agent’s code outperformed theoretical hardware limits of the GPU by a large factor.
Closer technical analysis later showed:
The agent wasn’t truly optimizing kernels.
It was reward hacking: exploiting weaknesses in the evaluation setup to score higher without real performance gains.
This again traces back to evaluation design. Without rigorous evaluation, agents can appear successful while actually failing to solve the real problem.
3. Why Evaluating Agents Is Harder Than Evaluating Models
3.1 Agents are not just input-output string mappers
Traditional language model evaluation is comparatively simple:
Provide an input string.
Collect an output string.
Score the output with a metric or rubric.
Agents, on the other hand, act in environments:
They call APIs and tools.
They write and execute code.
They traverse websites and stateful systems.
They may spawn sub-agents or recursive workflows.
Evaluating such systems requires simulated or real environments that capture these dynamics. This is far more complex than scoring next-token predictions.
3.2 Open-ended cost and behavior
Language model evaluations are usually bounded by context length; cost and complexity are constrained.
For agents:
The agent can keep calling tools, APIs, and sub-agents indefinitely.
Execution paths can branch, loop, and recurse.
There is no fixed ceiling on calls or runtime.
As a result, cost must be treated as a first-class metric alongside accuracy:
Number of model calls
Total tokens consumed
Wall-clock time
Tool/API usage cost
Ignoring cost gives an incomplete, often misleading picture of performance.
3.3 Purpose-built agents and fragmented benchmarks
A single LLM can be evaluated on many general-purpose benchmarks. Agents are more specialized:
A coding agent is different from a web-navigation agent.
A research agent is different from a customer-support agent.
A single static benchmark cannot meaningfully capture all dimensions of agent quality. Metrics need to be multi-dimensional, tailored to specific domains, while still allowing comparison on shared axes like accuracy, robustness, and cost.
4. Cost, Benchmarks, and the Jevons Paradox
4.1 Cost as a decision-making axis
To address these gaps, some research groups have started publishing agent leaderboards that report:
Accuracy or task success rate
Cost (in dollars or tokens)
Other operational metrics
Such leaderboards can reveal cases where two agents have similar accuracy but vastly different costs. For example, if:
Agent A achieves a given score at $664
Agent B achieves a similar score at $57
Then, even modest accuracy differences would not justify the 10× cost gap in many practical scenarios.
4.2 “Too cheap to meter”? Not quite
Model inference costs have dropped dramatically. A modern small model can outperform much older large models at a fraction of the price. This leads to the argument that cost may soon be negligible.
However, cost remains critical in practice because of the Jevons Paradox:
As a resource becomes cheaper and more efficient, total usage often increases, not decreases.
Historically, cheaper coal led to more coal consumption across industries.
The introduction of ATMs made it easier to open bank branches, increasing the demand for bank tellers rather than eliminating it.
The same pattern applies to LLMs and agents:
As inference gets cheaper, more agents, tools, and calls are used.
Applications scale to more users and more tasks.
Total cost at the system level can still be substantial.
Therefore, cost-conscious evaluation is not a temporary concern. It remains central to AI engineering.
4.3 Automated evaluation frameworks
To systematize evaluation, automated frameworks can:
Run agents across multiple benchmarks.
Log cost, success rate, and error patterns.
Track trade-offs between performance and resource usage.
These frameworks are important progress, but they do not fully solve the problem, because benchmark performance still does not guarantee real-world success.
5. The Benchmark Trap: When Leaderboards and Reality Diverge
Benchmarks and leaderboards influence funding and product narratives. Companies have raised large rounds and high valuations based largely on strong performance on specific benchmarks such as code-repair or bug-fixing suites.
However, independent real-world trials have found:
Agents that perform well on static benchmarks may struggle with open-ended, messy tasks.
In one qualitative analysis, a highly publicized coding agent integrated into a real development workflow successfully completed only 3 out of 20 real tasks tested over a month.
This mismatch arises because:
Benchmarks often assume clean problem statements and stable environments.
Real-world tasks are ambiguous, under-specified, or interdependent.
Production systems encounter errors that benchmarks do not model (versioning, flaky APIs, partial state changes, etc.).
Static benchmarks are still useful, but they are not sufficient to validate an agent system.
6. Who Validates the Validators?
A key proposal from recent work in evaluation research is to treat evaluators themselves as objects of scrutiny.
Typical evaluation pipelines look like this:
A model or agent generates outputs.
A static metric or LLM-based judge assigns scores.
This setup has two weaknesses:
Scoring criteria may be incomplete, simplistic, or misaligned with domain needs.
LLM judges can introduce their own biases or errors.
An alternative is to incorporate human domain experts:
Experts review and refine the evaluation criteria.
Experts adjust prompts, rubrics, or scoring functions for LLM judges.
Evaluation becomes a feedback loop rather than a fixed script.
This hybrid approach improves robustness but also underscores that evaluation itself is an engineering and research problem, not a solved detail.
7. Capability vs Reliability: The Core Confusion
A critical conceptual distinction often blurred in discussions about agents is the difference between capability and reliability.
7.1 Capability
Capability answers: What can the model do at least some of the time?
For example, a model might:
Prove a theorem in one run out of 100.
Fix a bug correctly given many attempts.
Solve a complex coding challenge occasionally.
Technically, this is captured by metrics like pass@k, which measure success when multiple attempts are allowed.
7.2 Reliability
Reliability answers: How often does the system do the right thing in practice, under constraints?
In real-world applications, users expect:
High success rates on the first try.
Consistency across time and scenarios.
Safe behavior even under distribution shift.
For many deployed systems, the target is close to “five nines” reliability:99.999% success or safety in critical operations.
Language models already demonstrate impressive capabilities across a wide range of tasks. But reliability is where many products fail, especially agents expected to handle consequential or frequent tasks.
A personal assistant that correctly handles a food order 80% of the time is not “mostly good”; it is unusable as a product. That gap between “occasionally brilliant” and “consistently dependable” is where many highly publicized agent products have faltered.
8. Verifiers, Unit Tests, and Their Limits
One proposed solution for reliability is to wrap model outputs with verifiers:
Unit tests for code
Checkers for constraints
Secondary models that judge correctness
In theory, verifiers filter out bad outputs and only accept solutions that pass strict tests.
However, in practice:
Popular coding benchmarks such as HumanEval and MBPP contain false positives in their test suites.
A model can generate incorrect code that still passes the tests.
When such flawed verifiers are used for inference-time scaling (trying many samples until one passes the tests), an unintended effect appears:
The more attempts the system makes, the more likely it is to exploit holes in the verifier.
Measured performance curves can bend downward instead of improving.
Apparent gains become artifacts of test flaws rather than real reliability improvements.
This does not mean verifiers are useless, but it does mean they are not a magic fix. Verifiers themselves require auditing, validation, and continuous improvement.
9. AI Engineering as Reliability Engineering
All of these challenges point to a central conclusion:
Building useful AI agents is less about pushing raw model performance and more about designing reliable systems around inherently stochastic components.
This is not a new kind of problem. There is a historical parallel in early computing hardware.
9.1 The ENIAC analogy
The ENIAC computer (1946) used over 17,000 vacuum tubes. Early on:
Tubes failed frequently.
The machine was unavailable roughly half the time.
The system was not reliably usable for end customers.
Engineers recognized that this level of reliability was unacceptable. The primary focus in the early years was reducing failures, improving uptime, and making the system stable enough to be useful. Reliability engineering was not optional; it was the core job.
9.2 The mindset shift for AI engineering
A similar shift is now required in AI:
Language models are powerful but stochastic.
Agents built on them inherit this uncertainty.
The primary job of AI engineers is not just to build impressive demos but to systematically reduce failure rates.
This involves:
Careful evaluation design
Human-in-the-loop validation
Robust error handling and fallbacks
Cost-aware deployment strategies
Monitoring and logging for real-world behavior
Guardrails for high-risk actions
Iterative refinement of verifiers and metrics
In other words, AI engineering should be treated as reliability engineering for a new computing substrate: stochastic models rather than deterministic circuits.
10. Key Takeaways
Three overarching lessons emerge:
Evaluation must be treated as a first-class problem.Poor evaluation leads to overstated claims, reward hacking, and fragile systems. Reliable agents require thoughtful benchmarks, human input, cost metrics, and continuous scrutiny of evaluators themselves.
Static benchmarks and leaderboard scores are not enough.They are useful but incomplete. Real-world performance involves cost, environment interaction, ambiguous tasks, and long-tail failures that benchmarks cannot fully capture.
The main bottleneck is reliability, not raw capability.Models are already capable of impressive feats. Turning that capability into dependable products requires a mindset shift toward reliability engineering, similar to what early computing required to become usable and trustworthy.
Agentic AI will likely play an increasingly important role in the software and product landscape. Turning today’s promising prototypes into systems that truly “work” demands engineering practices focused on robustness, cost, evaluation, and reliability—not just bigger models and better demos.






Comments