Benchmarks

LLM Extraction Leaderboard

Field-level accuracy across 21 document types, measured daily against ground-truth test documents. Lower latency and higher accuracy wins.

Last updated: March 13, 2026 · Auto-refreshes daily

Overall Rankings

#1 🏆 Best Overall

Claude Haiku 4.5

Field Accuracy 96.4%

Avg Latency 1,820 ms

Tasks Covered 21 / 21

GPT-4o Mini

Field Accuracy 93.1%

Avg Latency 2,140 ms

Tasks Covered 21 / 21

#3 ⚡ Fastest

Gemini 2.0 Flash

Field Accuracy 91.8%

Avg Latency 1,650 ms

Tasks Covered 21 / 21

Per-Task Accuracy

Field-level accuracy (%) per endpoint and model.

Core

Document Type	Claude Haiku	GPT-4o Mini	Gemini Flash
Invoice	98%	95%	93%
Receipt	97%	94%	92%
Resume	95%	93%	90%
Business Card	99%	97%	96%
Contract	94%	91%	89%
Bank Statement	96%	93%	91%

Tax

Document Type	Claude Haiku	GPT-4o Mini	Gemini Flash
W-2	98%	95%	93%
1099	97%	94%	92%
W-9	99%	96%	95%

Real Estate

Document Type	Claude Haiku	GPT-4o Mini	Gemini Flash
Lease Agreement	95%	91%	90%
Property Listing	97%	94%	92%
Closing Statement	94%	90%	88%

Healthcare

Document Type	Claude Haiku	GPT-4o Mini	Gemini Flash
Medical Bill	96%	92%	91%
Explanation of Benefits	93%	90%	88%
Prescription	98%	95%	94%

Logistics

Document Type	Claude Haiku	GPT-4o Mini	Gemini Flash
Bill of Lading	96%	93%	91%
Packing List	97%	94%	92%
Customs Declaration	95%	91%	90%

HR / General

Document Type	Claude Haiku	GPT-4o Mini	Gemini Flash
Job Posting	97%	94%	93%
Meeting Notes	94%	91%	89%
Purchase Order	96%	93%	92%

Methodology

Each model receives the identical prompt built by our production prompt library. Responses are parsed by our production parsers — the same code used in the live API.

Accuracy is measured at the field level: each extracted key-value pair is compared to a human-verified ground truth answer. Comparison is case-insensitive and whitespace-normalized. Numeric values are compared as strings after stripping currency symbols.

Latency is the wall-clock time from prompt submission to parsed response, averaged over all benchmark documents for the provider. Tests run from a single US-East server to minimize geographic variance.

Benchmark documents are hand-authored to cover realistic edge cases: missing fields, multiple line items, varied date formats, and non-standard layouts. Documents are held out from prompt development — models have not seen them before.

Today's World API currently uses Claude Haiku 4.5 exclusively. GPT-4o Mini and Gemini Flash results are shown for reference. All results are reproducible via the open-source benchmark runner in our pipeline.

Try the API yourself

Extract structured data from any document in seconds. First 100 calls free — no credit card required.

Try Free →