Batching — Physea Wiki

Batch APIs run many requests asynchronously at roughly half the standard price. The trade is patience: results come back within a window of up to 24 hours instead of immediately.

Most API calls are synchronous: you send a request and wait a few seconds for the answer. Batching flips that. You submit a large pile of requests at once, the provider works through them in the background, and you collect the results later. In return for giving up the instant answer, you pay less.

The discount is steep and consistent across providers. Anthropic’s Message Batches API charges 50% of standard prices on input, output, and special tokens alike.^[1] OpenAI’s Batch API offers the same “50% cost discount compared to synchronous APIs.”^[2] The savings come from the provider being able to schedule the work when it has spare capacity rather than answering on demand.

The cost is latency. Both providers quote a completion window of up to 24 hours, though most batches finish well inside it — Anthropic notes most complete within an hour, and lets you retrieve results once everything is done or after 24 hours, whichever comes first.^[1] OpenAI similarly states batches complete “within 24 hours (and often more quickly).”^[2] Anthropic caps a single batch at 100,000 requests or 256 MB, whichever comes first.^[1]

That makes batching a good fit for work that does not need an immediate reply: running evaluations, classifying or summarizing large datasets, embedding a document collection, or moderating a backlog of content.^[1]^[2] It is the wrong tool for anything a person is waiting on. For interactive work, the cost lever to reach for instead is prompt caching.

References

Batch processing — Anthropic
Batch — OpenAI

How does batching cut cost?

References