Strengths and failures

Agents shine on open-ended tasks with many small steps. The catch is arithmetic: reliability compounds, so the chance of finishing a long task cleanly drops fast.

Agents shine on open-ended tasks with a lot of small steps, where the path is not known up front and there is feedback to learn from at each turn: working through a codebase, researching across many sources, driving a tool with messy inputs.

The catch is arithmetic. Reliability compounds. If each step is right 85% of the time, ten steps in a row succeed only about 0.85¹⁰, which is roughly one in five attempts. Capability has been climbing, but reliability over long horizons has lagged: METR’s work measuring how long a task an AI can complete finds models far more dependable on short tasks than on long, multi-hour ones, with the time horizon they can handle improving but still bounded.^[1] This is the core reason agents are paired with guardrails and human checkpoints rather than turned loose. The workflows and rules pages cover the fixes.

References

Measuring AI Ability to Complete Long Tasks — METR

What are AI agents good at, and where do they fail?

References