How to Evaluate Document Extraction APIs

The Demo Document Is Not the Evaluation

Every document extraction API looks good on the vendor’s sample invoice.

The problem starts when your documents arrive: a scanned supplier invoice with a faint stamp, a contract with an annex table, a delivery note photographed from a truck cab, a receipt in another language, a PDF where the text layer exists but does not match the visual reading order.

If your evaluation is “upload three clean PDFs and check whether JSON comes back,” you are not evaluating production behavior. You are evaluating the happy path.

A useful evaluation asks a different question: can this API support the workflow you are actually shipping?

That means testing accuracy, but not only accuracy. It also means testing schemas, confidence scores, source evidence, validation behavior, failure modes, cost shape, compliance fit, and what happens after extraction.

Start With the Workflow, Not the Vendor Matrix

Before comparing APIs, write down the workflow.

Document extraction is rarely the final product. The extracted fields usually feed another step:

Accounting writes invoice data to an ERP.
Operations generates a spreadsheet for review.
A compliance workflow creates a PDF checklist.
An AI agent converts a document to Markdown, extracts fields, and generates a summary.
An agency runs the same pattern across multiple client projects with different schemas.

Those downstream steps define what matters.

An API that is excellent at returning plain text may be the wrong choice if the workflow needs typed money fields, source citations, and per-field confidence. An API that extracts invoices well may be weak for contracts. A provider with strong single-operation accuracy may still be painful if you need to generate reports or spreadsheets from the result using another vendor.

Start the evaluation with a short workflow brief:

Workflow: supplier invoice intake to approval pack
Input: emailed PDF invoices, mixed digital and scanned
Output: approved JSON for ERP, XLSX tracker, PDF approval summary
Critical fields: supplier, invoice number, date, due date, VAT ID, line items, subtotal, tax, total, currency
Risk: wrong payment amount, duplicate invoice, missing VAT ID, EU customer data leaving approved processors
Review rule: low-confidence money fields route to human review before ERP update

That brief keeps the evaluation grounded. You are not buying generic extraction. You are buying a workflow boundary.

Build a Real Test Set

The test set decides whether the evaluation means anything.

Do not use only clean documents. Do not use only documents where you already know the API performs well. Do not let the vendor choose the files.

Build a set that reflects production traffic:

Clean digital PDFs
Scanned PDFs
Mobile photos of documents
Multi-page documents
Documents with tables split across pages
Documents with missing optional fields
Documents with unusual currencies, dates, or address formats
Documents from new suppliers or unfamiliar layouts
Documents that should be rejected or routed to review

For most teams, 50 to 100 real documents is enough to expose the important differences. If the workflow is high-risk or high-volume, use more. If you support multiple document types, evaluate each type separately instead of mixing everything into one score.

Separate the test set into categories before running the APIs:

Category	Why it matters
Known-good documents	Establish baseline happy-path behavior
Messy but valid documents	Tests scans, layout variation, and OCR resilience
Edge cases	Tests rejection, review routing, and error clarity
New formats	Tests generalization beyond templates
High-risk fields	Tests values where mistakes are expensive

If you cannot explain what each group tests, the evaluation will collapse into a vague accuracy impression.

Create Ground Truth Before Looking at Results

Ground truth is the human-approved answer for each document.

Create it before looking at API output. Otherwise the API result can subtly bias the reviewer. If the API says the total is 1,847.80, the reviewer may accept it faster than they would have found it independently.

For each document, record:

Expected field values
Acceptable aliases or formatting variants
Required vs optional fields
Business validation rules
Whether a human should review the field
Source location when available

The source location matters because some extraction errors are not obvious from the value alone. A vendor name can be correct but come from the wrong party. A contract date can be real but refer to renewal instead of effective date. A total can be present but be the subtotal, not the payable amount.

For field evaluation, compare against the intended business meaning, not only the string.

Score Fields, Not Documents

Document-level accuracy hides the failures that matter.

An invoice with nine correct fields and one wrong payment amount is not “90% correct” in any useful sense. It is a payment risk. A contract extraction with correct parties and dates but a wrong termination clause may be unacceptable even if most fields are right.

Score each field separately:

Field	Exact match	Business-valid match	Confidence	Citation correct	Action
`invoice_number`	Yes	Yes	0.97	Yes	Accept
`total_amount`	No	No	0.83	Yes	Review
`currency`	Yes	Yes	0.96	Yes	Accept
`due_date`	No	Ambiguous	0.71	Partial	Review

This gives you a clearer picture than a single pass/fail result.

Track field-level metrics by importance:

Required field extraction rate
High-risk field correctness
Optional field usefulness
Table row completeness
False accept rate for fields that should have gone to review
False review rate for fields that could have been accepted automatically

The last two are where real operations cost appears. Reviewing too much wastes time. Accepting too much creates downstream risk.

Test the Schema Model

The schema model is the API’s contract with the rest of your workflow.

A weak schema model forces you to do cleanup after extraction. A strong one lets downstream code rely on typed fields.

Evaluate whether the API supports the shapes your workflow needs:

Primitive fields: text, number, boolean, date
Purpose-built fields: currency amount, currency code, address, country, IBAN, email
Arrays for line items, clauses, tables, transactions, or repeated parties
Nested fields inside arrays
Required fields and defaults
Calculated fields or validation rules
Enum values where only known options are allowed

Then test schema failure behavior. A good API should reject an invalid schema before processing documents. It should not spend money on extraction and then return a vague error after the fact.

This matters for production. Schema errors are developer errors. Document errors are input errors. Your workflow needs to handle them differently.

Evaluate Confidence as a Routing Signal

Confidence scores are useful only if they help the workflow decide what to do next.

Do not just check whether the API returns confidence. Check whether confidence is calibrated enough for routing.

For each field, compare confidence against correctness:

Confidence band	Expected behavior	What to measure
0.95-1.00	Usually safe to auto-accept	Wrong values in this band are dangerous
0.80-0.95	Often usable, but domain-dependent	Good candidate for per-field thresholds
0.60-0.80	Review zone	Should produce useful pre-filled values
Below 0.60	Manual or reject	Should not create false certainty

The worst failure is not low confidence. Low confidence is manageable. The worst failure is high confidence on a wrong value.

Run the evaluation by field type. Money fields should usually require higher confidence than names. Addresses may need review even when confidence is moderate because formatting and regional conventions vary. Contract clauses may need human review because the cost of misclassification is high.

If you are building a review flow, read the confidence-score routing guide and the n8n low-confidence review guide. The evaluation should tell you where to set thresholds, not just which API has the prettiest JSON.

Require Source Evidence

Source citations are the difference between “the model says so” and “the workflow can explain itself.”

When an API extracts a value, it should show where that value came from. That might be a quote, page reference, bounding context, or normalized source field depending on the provider.

Evaluate citations separately from values:

Does the citation support the extracted value?
Does it point to the correct occurrence when the value appears multiple times?
Does it preserve enough context for a reviewer?
Does it work for tables and repeated rows?
Does it survive scanned documents and OCR paths?

Citations matter most when humans review exceptions. A review task that says total_amount = 1847.80 is incomplete. A task that also shows Source: "Total EUR 1,847.80" is much easier to approve or correct.

They also matter for audits. If an auditor asks why a payment amount entered the system, you need more than a JSON value. You need a trace back to the source document.

Test Failure Modes Deliberately

Most evaluations accidentally avoid failures.

Production workflows cannot. Add documents that should fail or route to review:

Password-protected PDFs
Oversized files
Unsupported formats
Empty scans
Corrupted files
Ambiguous documents
Documents missing required fields
Tables with unreadable rows
Multi-document files where the workflow expects one document

Then inspect the API behavior:

Is the error clear enough to show an operator?
Does it distinguish invalid schema from invalid input?
Does it return partial results when appropriate?
Can the workflow retry safely?
Does async processing report failure consistently?
Are rate-limit errors structured and recoverable?

Failure behavior matters because successful documents are not where operators lose time. Operators lose time when a job gets stuck with a vague error and nobody knows whether to retry, reject, or review.

The hidden PDF failure modes guide covers the document-side traps in more detail.

Check Workflow Fit After Extraction

Extraction is only one step.

If the workflow needs a spreadsheet, approval PDF, webhook, database write, or review branch, evaluate how much glue code sits after extraction.

Ask:

Can the extracted JSON feed the next step without reformatting?
Are field types stable enough for downstream code?
Does the provider support async processing and webhooks for larger files?
Are errors consistent with the other APIs you use?
Can you generate the final artifact with the same platform, or do you need another vendor?
Can usage and cost be tracked per workflow or client project?

This is where single-purpose extraction tools often look better in isolation than in the real workflow. They may extract fields well but leave you with another integration for PDF generation, sheet generation, review tasks, and billing reconciliation.

If the extracted output feeds a report, tracker, or generated document, include that downstream step in the evaluation. A slightly better extraction score can lose to a much simpler end-to-end workflow.

Evaluate Compliance and Data Handling

Document extraction often handles invoices, contracts, IDs, medical files, claims, resumes, or customer correspondence. The API sees sensitive data.

Do not leave compliance checks until procurement.

Ask each provider:

Where is processing performed?
Is file content stored after processing?
Are extracted values stored in logs?
How long are operational logs retained?
Is a Data Processing Agreement available?
Which sub-processors may see document data?
Can support staff access uploaded files or extracted content?
Are webhooks and failed requests logged with payloads?

For EU-facing workflows, region claims are not enough. The workflow may include upload storage, OCR, extraction, review, generation, webhooks, and logs. Each step needs to fit the data-flow model. The EU-hosted AI workflows guide is a useful checklist for that part of the evaluation.

Compare Cost by Workflow, Not by Page

Per-page pricing is easy to compare and easy to misread.

The cost that matters is the cost of the whole workflow:

Extraction cost
Review cost for low-confidence fields
Retry cost for failures
Document generation or spreadsheet generation cost
Vendor integration and maintenance cost
Cost of wrong auto-accepted fields
Cost of billing and usage attribution across clients or projects

For example, one provider may be cheaper per page but return weak confidence scores. If that doubles the review queue or increases payment errors, the cheap API is expensive. Another provider may cost more for extraction but reduce vendors by also generating the approval PDF and XLSX export.

Calculate cost against realistic monthly volume and realistic exception rates. Include peak months. Include document length distribution. Include multi-page invoices, not just one-page samples.

For agency workflows, include onboarding cost per client. A provider that requires separate vendor accounts, separate credentials, and separate billing per client can erase the savings from a lower unit price.

What Iteration Layer Optimizes For

Iteration Layer is built for document workflows where extraction has to feed the next step.

Document Extraction returns typed fields with confidence scores and source citations. Document to Markdown handles full-text conversion when the workflow needs readable context for RAG, search, summarization, or agents. Document Generation and Sheet Generation turn approved data into PDFs, DOCX files, XLSX exports, CSV files, or Markdown tables.

The evaluation angle is composability. If your workflow stops at JSON, a specialized extractor may be enough. If your workflow continues into review, reporting, spreadsheets, or generated client artifacts, consistent APIs matter: one auth model, one credit pool, one error style, and compatible outputs.

For concrete pipeline examples, see the extract invoices to spreadsheet recipe, the invoice-to-PDF report recipe, and the Document Extraction docs.

The Evaluation Checklist

Before choosing a document extraction API, answer these questions with evidence from your own test set:

Does the test set include clean documents, messy documents, edge cases, and rejection cases?
Is ground truth recorded before reviewing API output?
Are results scored per field, not only per document?
Does the schema model support your real fields, arrays, and validation needs?
Are confidence scores calibrated well enough for routing?
Do citations point to the right source evidence?
Are failure modes clear, structured, and recoverable?
Can the extracted output feed review, generation, spreadsheets, or downstream systems without fragile glue code?
Does the data handling model fit your compliance requirements?
Is cost calculated for the full workflow, not just per page?

If an API performs well on those questions, it is a serious candidate. If it only performs well on a vendor demo file, keep testing.

Ingest

Generate

Integrations

Built for

By product

By industry

Overview

APIs

Integrations

Billing

Benchmarks

Blog

More