Iteration Layer

OCR Benchmark: Testing extraction accuracy on real-world documents

Scanned invoices, forms, receipts, tables, and charts feed into extraction, reporting, and automation workflows. How well the document converts to markdown decides everything downstream — garbage markdown in, garbage results out. This benchmark shows how the current pipeline performed on the document set we use for evaluation.

No credit card required — start with free trial credits

Zero data retention · GDPR Made & hosted in the EU $65 free trial credits No credit card required 14-day money-back guarantee

How we measured extraction quality

We ran 41 real workflow files — forms, invoices, scans, tables, charts, and photos — through each OCR pipeline, then had Gemini 2.5 Flash Lite judge every markdown output against the source image for text accuracy, layout, tables, and detail preservation.

Input 41 real workflow files
Convert OCR pipeline produces markdown
Judge Gemini compares output to source image
Score 0.0–1.0 per file
Source files

41

Real workflow inputs across 27 document categories

Models tested

8

Current API plus 7 reference models

Judge

Gemini 2.5 Flash Lite

Same prompt for every model

Our score

0.93

Second place, 0.01 behind the top model

Tested file categories

Account statements
Bank checks
Charts and diagrams
Commercial leases
Credit card statements
Delivery notes
Government ID documents
Earnings reports
Equipment inspection forms
Government and tax-style forms
FUNSD-style forms
Glossaries and memos
Nutrition labels
Patents
Patient intake forms
Pay-in sheets and paystubs
Petition forms
Photo documents, receipts, and tables
Proxy voting documents
Quarterly reports
Real-estate documents
Scanned forms and scanned tables
Shift schedules
Shipping invoices
Slide screenshots
SROIE-style receipts
Plain tables

Results across all models

Same files, same judge, same prompt for every model. Iteration Layer OCR scored 0.93 — second place overall, 0.01 behind the best-scoring model in the suite.

Average score by OCR pipeline

Fixed 0.0 to 1.0 scale. Differences are visible without cropping the axis.

Chandra-OCR-2
0.94
Iteration Layer OCR
0.93
Qwen3-VL-Instruct
0.91
Gemma 4 A4B
0.89
MiniCPM-o 4.5
0.88
InternVL3.5
0.82
GLM-OCR
0.80
LightOnOCR-2
0.79
Model Avg score Passed Strength Weaknesses
Chandra-OCR-2 5B Q4 local run
0.94 41/41 Best overall text and layout preservation in this suite. Occasionally over-formats simple pages where plain markdown is easier to use.
Iteration Layer OCR Current Document to Markdown API
0.93 41/41 Strong across mixed business documents, forms, scans, and tables. Very dense diagrams and charts can still need a human check.
Qwen3-VL-Instruct Instruct Q4 local run
0.91 39/41 Good general-purpose markdown on forms, receipts, and tables. Missed several cases where exact detail preservation mattered.
Gemma 4 A4B A4B MoE with vision budget fix
0.89 34/41 Good markdown structure on many document layouts. Missed important details in several cases.
MiniCPM-o 4.5 Q4 local run
0.88 39/41 Strong extraction quality on typical document pages. Less consistent on edge cases than the top rows.
InternVL3.5 InternVL3.5 Q4 local run
0.82 33/41 Good formatting on many pages. Lower accuracy on financial and form details.
GLM-OCR 0.9B Q8 local run
0.80 34/41 Good for simple text-heavy pages where layout is secondary. Lower layout and detail reliability than the top rows.
LightOnOCR-2 1B Q8 local run
0.79 32/41 Good markdown formatting for simple documents. Weaker on dense layouts and missed more cases.

What this means for you

Higher extraction quality means fewer manual checks and more reliable workflow output. Single-provider EU-hosted conversion, extraction, and generation share one API style and one credit pool — keeping the pipeline simple and reliable.

Less human review

Fewer missing rows, changed numbers, or garbled fields means fewer routine documents need a person to inspect the markdown before the workflow continues.

More confidence in automation

Consistent conversion output means invoice, contract, and intake workflows run further before routing exceptions to a human.

Better downstream data

Cleaner markdown means extraction and generation APIs produce better output, whether the result goes to a spreadsheet, a report, or an MCP-connected agent.

Faster document turnaround

Files move from upload to extracted fields, generated reports, or spreadsheet exports with less manual correction between steps.

Fewer broken pipeline steps

Reliable OCR output reduces custom cleanup code between pipeline steps. With conversion, extraction, and generation on one platform, the integration points that remain are simpler too.

More trust in client deliverables

When the source markdown is accurate, the reports, summaries, and spreadsheets generated from it are too. Less time fixing deliverables before they reach the client.

OCR is just the first step

Most teams do not stop at markdown. They extract fields, generate reports, create images, or hand the result to an agent. These workflows show how document-to-markdown conversion connects to the rest of the platform.

Try with your own data

Upload your own documents and see the results. Conversion, extraction, and generation all share one credit pool.

Frequently asked questions

How was this benchmark run?
We ran the same 41-document OCR evaluation suite against the current Iteration Layer OCR pipeline and one reference run per model family. The files cover forms, invoices, scans, receipts, tables, charts, photos, statements, reports, and similar workflow inputs. This is our evaluation suite, not a universal claim about every possible document.
What did the Gemini judge evaluate?
Gemini 2.5 Flash Lite received the original source image and the extracted markdown output. It scored each result from 0.0 to 1.0 using the same prompt for every pipeline, checking completeness, text accuracy, document structure, and whether the output added text that was not present in the image. A score of 0.70 or higher counted as a pass.
Why not publish the source documents?
The suite is built from realistic workflow documents, so we publish the document categories and methodology rather than the source files themselves. That lets readers understand the coverage without turning private, copyrighted, or sensitive-looking examples into public benchmark fixtures.
Are these results a guarantee for my documents?
No. The benchmark shows how the current pipeline performed on our 41-file evaluation suite. It is useful evidence for expected behavior on similar inputs, but your own files may differ in scan quality, layout, language, handwriting, image compression, or document conventions.
Can the ranking change over time?
Yes. OCR models, prompts, quantization settings, and serving conditions change. We disclose the evaluated pipeline names and run details so results can be challenged, repeated, or updated when better data becomes available.