Benchmarks
LLM Extraction Leaderboard
Field-level accuracy across 21 document types, measured daily against ground-truth test documents. Lower latency and higher accuracy wins.
Last updated: March 13, 2026 ยท Auto-refreshes daily
Overall Rankings
Claude Haiku 4.5
GPT-4o Mini
Gemini 2.0 Flash
Per-Task Accuracy
Field-level accuracy (%) per endpoint and model.
Core
| Document Type | Claude Haiku | GPT-4o Mini | Gemini Flash |
|---|---|---|---|
| Invoice | 98% | 95% | 93% |
| Receipt | 97% | 94% | 92% |
| Resume | 95% | 93% | 90% |
| Business Card | 99% | 97% | 96% |
| Contract | 94% | 91% | 89% |
| Bank Statement | 96% | 93% | 91% |
Tax
| Document Type | Claude Haiku | GPT-4o Mini | Gemini Flash |
|---|---|---|---|
| W-2 | 98% | 95% | 93% |
| 1099 | 97% | 94% | 92% |
| W-9 | 99% | 96% | 95% |
Real Estate
| Document Type | Claude Haiku | GPT-4o Mini | Gemini Flash |
|---|---|---|---|
| Lease Agreement | 95% | 91% | 90% |
| Property Listing | 97% | 94% | 92% |
| Closing Statement | 94% | 90% | 88% |
Healthcare
| Document Type | Claude Haiku | GPT-4o Mini | Gemini Flash |
|---|---|---|---|
| Medical Bill | 96% | 92% | 91% |
| Explanation of Benefits | 93% | 90% | 88% |
| Prescription | 98% | 95% | 94% |
Logistics
| Document Type | Claude Haiku | GPT-4o Mini | Gemini Flash |
|---|---|---|---|
| Bill of Lading | 96% | 93% | 91% |
| Packing List | 97% | 94% | 92% |
| Customs Declaration | 95% | 91% | 90% |
HR / General
| Document Type | Claude Haiku | GPT-4o Mini | Gemini Flash |
|---|---|---|---|
| Job Posting | 97% | 94% | 93% |
| Meeting Notes | 94% | 91% | 89% |
| Purchase Order | 96% | 93% | 92% |
Methodology
Each model receives the identical prompt built by our production prompt library. Responses are parsed by our production parsers โ the same code used in the live API.
Accuracy is measured at the field level: each extracted key-value pair is compared to a human-verified ground truth answer. Comparison is case-insensitive and whitespace-normalized. Numeric values are compared as strings after stripping currency symbols.
Latency is the wall-clock time from prompt submission to parsed response, averaged over all benchmark documents for the provider. Tests run from a single US-East server to minimize geographic variance.
Benchmark documents are hand-authored to cover realistic edge cases: missing fields, multiple line items, varied date formats, and non-standard layouts. Documents are held out from prompt development โ models have not seen them before.
Today's World API currently uses Claude Haiku 4.5 exclusively. GPT-4o Mini and Gemini Flash results are shown for reference. All results are reproducible via the open-source benchmark runner in our pipeline.
Try the API yourself
Extract structured data from any document in seconds. First 100 calls free โ no credit card required.
Try Free →