AI agents
What are AI agents good at, and where do they fail?
Agents shine on open-ended tasks with many small steps. The catch is arithmetic: reliability compounds, so the chance of finishing a long task cleanly drops fast.
Agents shine on open-ended tasks with a lot of small steps, where the path is not known up front and there is feedback to learn from at each turn: working through a codebase, researching across many sources, driving a tool with messy inputs.
The catch is arithmetic. Reliability compounds. If each step is right 85% of the time, ten steps in a row succeed only about 0.85¹⁰, which is roughly one in five attempts. Capability has been climbing, but reliability over long horizons has lagged: METR’s work measuring how long a task an AI can complete finds models far more dependable on short tasks than on long, multi-hour ones, with the time horizon they can handle improving but still bounded.[1] This is the core reason agents are paired with guardrails and human checkpoints rather than turned loose. The workflows and rules pages cover the fixes.