Benchmarks

LLM Extraction Leaderboard

Field-level accuracy across 21 document types, measured daily against ground-truth test documents. Lower latency and higher accuracy wins.

Last updated: March 13, 2026 ยท Auto-refreshes daily

Overall Rankings

#1 ๐Ÿ† Best Overall

Claude Haiku 4.5

Field Accuracy 96.4%
Avg Latency 1,820 ms
Tasks Covered 21 / 21
#2

GPT-4o Mini

Field Accuracy 93.1%
Avg Latency 2,140 ms
Tasks Covered 21 / 21
#3 โšก Fastest

Gemini 2.0 Flash

Field Accuracy 91.8%
Avg Latency 1,650 ms
Tasks Covered 21 / 21

Per-Task Accuracy

Field-level accuracy (%) per endpoint and model.

Core

Document Type Claude Haiku GPT-4o Mini Gemini Flash
Invoice 98% 95% 93%
Receipt 97% 94% 92%
Resume 95% 93% 90%
Business Card 99% 97% 96%
Contract 94% 91% 89%
Bank Statement 96% 93% 91%

Tax

Document Type Claude Haiku GPT-4o Mini Gemini Flash
W-2 98% 95% 93%
1099 97% 94% 92%
W-9 99% 96% 95%

Real Estate

Document Type Claude Haiku GPT-4o Mini Gemini Flash
Lease Agreement 95% 91% 90%
Property Listing 97% 94% 92%
Closing Statement 94% 90% 88%

Healthcare

Document Type Claude Haiku GPT-4o Mini Gemini Flash
Medical Bill 96% 92% 91%
Explanation of Benefits 93% 90% 88%
Prescription 98% 95% 94%

Logistics

Document Type Claude Haiku GPT-4o Mini Gemini Flash
Bill of Lading 96% 93% 91%
Packing List 97% 94% 92%
Customs Declaration 95% 91% 90%

HR / General

Document Type Claude Haiku GPT-4o Mini Gemini Flash
Job Posting 97% 94% 93%
Meeting Notes 94% 91% 89%
Purchase Order 96% 93% 92%

Methodology

Each model receives the identical prompt built by our production prompt library. Responses are parsed by our production parsers โ€” the same code used in the live API.

Accuracy is measured at the field level: each extracted key-value pair is compared to a human-verified ground truth answer. Comparison is case-insensitive and whitespace-normalized. Numeric values are compared as strings after stripping currency symbols.

Latency is the wall-clock time from prompt submission to parsed response, averaged over all benchmark documents for the provider. Tests run from a single US-East server to minimize geographic variance.

Benchmark documents are hand-authored to cover realistic edge cases: missing fields, multiple line items, varied date formats, and non-standard layouts. Documents are held out from prompt development โ€” models have not seen them before.

Today's World API currently uses Claude Haiku 4.5 exclusively. GPT-4o Mini and Gemini Flash results are shown for reference. All results are reproducible via the open-source benchmark runner in our pipeline.

Try the API yourself

Extract structured data from any document in seconds. First 100 calls free โ€” no credit card required.

Try Free →