Best Document Extraction APIs in 2026

“Document Extraction” Means Six Different Things

You search for “document extraction API” and get a page of results that all claim to solve your problem. Except they’re solving different problems. One gives you markdown. Another gives you key-value pairs. A third gives you bounding boxes. A fourth wants you to train a model before it extracts anything.

The term “document extraction” has been stretched to cover everything from basic OCR to full workflow automation platforms. That makes the landscape genuinely confusing for developers who need to pick a tool and ship something.

This guide breaks the market into categories, covers the major players in each, and helps you figure out which one actually fits your use case. No rankings. No “best overall” badge. Just what each tool does, where it breaks down, and what it costs.

The Six Categories

Before looking at individual tools, it helps to understand the categories. Each one solves a different part of the document processing problem, and picking a tool from the wrong category is the most common mistake teams make.

OCR engines convert documents to text or markdown. They handle the hard part of reading pixels and turning them into characters. But they stop there — the output is text, not structured data. You still need to parse that text into something your application can use.

Cloud document intelligence platforms offer pre-built models for common document types (invoices, receipts, IDs) plus the ability to train custom models. They’re backed by major cloud providers, deeply integrated with their ecosystems, and priced per page with volume tiers.

RAG and LLM preprocessing tools convert documents into chunks optimized for retrieval-augmented generation. They care about preserving document structure — tables, headers, sections — so that the resulting chunks produce better retrieval results. They’re not designed to extract specific fields.

Structured extraction APIs take a schema you define and return typed JSON. You describe the fields you want — their names, types, and descriptions — and the API extracts them from any document. No model training. No template configuration.

Enterprise IDP platforms wrap extraction in workflow automation — approval queues, human-in-the-loop review, ERP integrations, business rules. They target ops teams as much as developers.

VLM-based approaches skip purpose-built APIs entirely and use vision-language models (Gemini, GPT-5.4, Claude) to read documents directly. Maximum flexibility, minimum guardrails.

Each category serves a real need. The question is which need is yours.

OCR Engines

Mistral OCR

Mistral OCR is the price leader in the OCR space. You send a PDF extraction request or image, you get back markdown — with tables converted to markdown tables, headers to ##, and images preserved with references.

The latest version, Mistral OCR 3, ships a smaller and faster model tuned for forms, scanned documents, complex tables, and handwriting. Accuracy on clean, well-structured documents is strong. Pricing is aggressive: $2 per 1,000 pages, or $1 per 1,000 pages on the Batch API.

Best for: High-volume OCR where you need text or markdown output and you’re handling the structured extraction yourself — either with an LLM prompt, regex, or custom parsing logic.

Key limitations:

Output is markdown, not structured data — you still need to parse it
Accuracy degrades on complex multi-column layouts and nested tables
No typed fields, no confidence scores, no schema definition
Every document variation may need different parsing logic downstream

Pricing: $2/1,000 pages (standard), $1/1,000 pages (batch).

Tesseract

Tesseract is the open-source OCR engine maintained by Google. It’s free, runs locally, and supports 100+ languages. For straightforward text extraction from scanned documents, it works.

But Tesseract is an OCR engine from the pre-LLM era. It doesn’t understand document layout beyond basic paragraph detection. Tables come back as jumbled text. Multi-column documents get merged. Handwriting accuracy is poor compared to modern alternatives.

Best for: On-premise deployments where data can’t leave your infrastructure, or budget-constrained projects with simple, consistent document layouts.

Key limitations:

No layout analysis — tables, columns, and headers are not preserved
Accuracy significantly worse than modern ML-based OCR
No API — you host and scale it yourself
No structured output of any kind

Pricing: Free (open-source). You pay for compute.

Cloud Document Intelligence Platforms

Azure AI Document Intelligence

Azure AI Document Intelligence — formerly Form Recognizer — organizes document processing around models. Pre-built models for invoices, receipts, ID documents, tax forms, contracts. Custom models you train on your own labeled data.

The pre-built models are good for standard document types. Azure’s invoice model extracts VendorName, InvoiceTotal, DueDate, and about two dozen other fields out of the box. But the moment your document doesn’t fit a pre-built category, you’re training a custom model — labeling samples in Azure’s Studio, running training jobs, managing model versions.

Best for: Teams already on Azure that process standard document types (invoices, receipts, IDs) and are willing to invest in custom model training for non-standard types.

Key limitations:

Custom models require labeled training data and ML ops workflow
Pre-built models have fixed field sets — you can’t add custom fields without training or add-ons
Query Fields add-on costs extra ($10/1,000 pages on top of the base model price)
Pricing requires understanding which model types, add-ons, and training hours you’ll use
No native support for XLSX, CSV, or JSON input

Pricing: $1.50/1,000 pages (Read/OCR), $10/1,000 pages (pre-built models), $30/1,000 pages (custom models). Add-ons like Query Fields and high resolution add $6–$10/1,000 pages on top.

AWS Textract

AWS Textract doesn’t have one document extraction API. It has five: DetectDocumentText, AnalyzeDocument, AnalyzeExpense, AnalyzeID, and AnalyzeLending. Each has its own endpoint, response format, and pricing tier.

AnalyzeDocument is the most flexible — it handles forms, tables, queries, and signatures — but each feature is billed separately. A single page analyzed for forms ($0.05), tables ($0.015), and queries ($0.015 per query) runs $0.08 or more before you’ve written any business logic.

The response format is a flat list of Block objects with ID-based relationships. Reconstructing a table means traversing a block graph. Every value is an untyped string. There’s no schema definition and no concept of field types.

Best for: Teams deeply invested in the AWS ecosystem that need document processing alongside S3, Lambda, and other AWS services.

Key limitations:

Five separate APIs with different response formats — you pick the right one per document type
All values returned as untyped strings — type parsing is your problem
No schema definition — you get everything Textract finds, then filter
Block-based response format requires significant post-processing
Queries billed per query per page — asking ten questions costs ten charges

Pricing: $0.0015/page (text detection), $0.015/page (tables), $0.05/page (forms), $0.015/query (queries). Features stack — a page with forms, tables, and queries costs $0.08+.

Google Document AI

Google Document AI is processor-based. You create a processor instance for a document type, send documents to it, and get results back. Pre-built processors cover OCR, forms, invoices, expenses. Custom processors require labeled training data — Google recommends at least 50 samples for decent accuracy.

The most frustrating constraint is the 10-page synchronous limit. Anything longer requires batch processing through Google Cloud Storage — upload to GCS, trigger processing, poll for completion, download results. A 15-page contract becomes a four-step async workflow.

Best for: Teams on GCP that process standard document types and can work within the processor model. Strong for high-volume OCR where GCS integration is already in place.

Key limitations:

Synchronous requests capped at 10 pages — longer documents need batch processing through GCS
Custom processors need 50+ labeled training samples
No support for DOCX, XLSX, CSV, or JSON input
Processor-based architecture means managing instances per document type
Not all processors available in all regions

Pricing: $1.50/1,000 pages (standard), dropping to $0.60/1,000 pages after 5M pages/month. Google Cloud trial credits may apply for new accounts.

RAG and LLM Preprocessing Tools

This section is about tools that turn documents into text, markdown, chunks, or elements for retrieval. That is a different job from schema-based field extraction. Iteration Layer covers this category with Document to Markdown for files and public pages, and Website Extraction when you need typed metadata from a public URL. The Document Extraction API below is the structured-field API, not the RAG preprocessing API.

LlamaParse

LlamaParse is LlamaIndex’s document parsing service. It converts PDFs, Word documents, slides, and spreadsheets into clean markdown, plain text, or structured JSON chunks optimized for RAG pipelines.

It offers four processing tiers — Fast, Cost Effective, Agentic, and Agentic Plus — each trading cost for quality. The Agentic tiers use vision models to interpret complex layouts, charts, and figures. For RAG, this matters: better parsing produces better chunks, which produce better retrieval results.

LlamaParse is a parsing tool, not an extraction tool. The output is document content in a consumable format — not “the invoice total is $4,250.” If you need specific fields in your database, you’re adding an LLM extraction step on top of LlamaParse’s output.

Best for: RAG pipelines where you need high-quality document chunks for a vector store. Especially useful with the LlamaIndex ecosystem.

Key limitations:

Output is markdown/text/chunks, not structured fields — extraction is a separate step
Tightly coupled with the LlamaIndex ecosystem
Credit-based pricing can be hard to predict at scale
No confidence scores, no typed fields, no schema definition
Extracting structured data requires chaining with a separate LLM call

Pricing: Credit-based. 1,000 credits cost $1–$1.50 depending on region. Per-page costs: ~3 credits (Cost Effective), ~10 credits (Agentic), ~90 credits (Agentic Plus). 10,000 free credits/month.

Unstructured.io

Unstructured.io started as an open-source library for document preprocessing and now offers a hosted platform. It handles parsing, chunking, and embedding — the full preprocessing pipeline for getting documents into a vector database or LLM context window.

The open-source core (unstructured) is genuinely useful. It parses PDFs, images, HTML, emails, and more into structured elements (titles, narrative text, tables, lists) with metadata. The hosted platform adds scale, monitoring, and managed infrastructure.

Like LlamaParse, Unstructured is a preprocessing tool. It cares about document structure for chunking quality, not for extracting specific business fields.

Best for: Teams building RAG pipelines that want an open-source core with optional managed infrastructure. Good for mixed-format document corpora.

Key limitations:

Designed for RAG preprocessing, not structured field extraction
Open-source version requires self-hosting and scaling
Hosted platform pricing is opaque — pay-as-you-go without published per-page rates
No schema definition, no typed fields, no confidence scores

Pricing: Open-source core is free. Hosted platform is pay-as-you-go with 15,000 free pages. Contact sales for volume pricing.

Structured Extraction APIs

Iteration Layer

Iteration Layer’s Document Extraction API takes a schema-first approach to extraction. You define the fields you want — their names, types, and descriptions — and the API extracts them from any document as typed JSON.

The schema system supports 17 purpose-built field types: TEXT, TEXTAREA, INTEGER, DECIMAL, BOOLEAN, DATE, DATETIME, TIME, EMAIL, IBAN, COUNTRY, CURRENCY_CODE, CURRENCY_AMOUNT, ADDRESS, ARRAY, ENUM, and CALCULATED. The types do real work during extraction. Define a field as currency_amount and you get a numeric value with proper decimal handling. Define address and the API decomposes it into street, city, region, postal code, and country. Define date and you get ISO 8601 regardless of whether the document says “February 27, 2026” or “27/02/2026.”

CALCULATED fields reference other extracted fields. Define totalCheck as unitPrice * quantity and the API computes it during extraction — built-in validation without post-processing code.

Every extracted value includes a confidence score (0 to 1) and a source citation — the verbatim text the model read to produce the value. Confidence scores let you route low-confidence extractions to human review. Source citations let reviewers verify without opening the original document.

The API handles the same broad ingestion pipeline documented in the Document Extraction docs — PDFs, images, Office documents, email, notebooks, text/markup formats, and public website URLs. Structured formats like Excel, CSV, and ODS aren’t OCR’d — they’re parsed as structured data, avoiding the accuracy loss that OCR introduces on already-digital files.

Multi-file extraction accepts up to 20 files per request, combining them into a single extraction context. A loan application spanning a bank statement, pay stub, and tax return is one API call, one schema, one response.

Best for: Developers who need typed, structured JSON from any document type — without training models, managing processors, or writing parsing logic.

Key limitations:

Newer entrant — smaller community and ecosystem compared to cloud providers
No on-premise deployment option
Use Document to Markdown, not Document Extraction, when you need RAG preprocessing or full-text search input

Pricing: See pricing page.

Reducto

Reducto sits in the same structured extraction category as Iteration Layer. Their Extract API takes a JSON Schema definition and returns structured data from documents. They’ve raised $108M total, including a $75M Series B led by Andreessen Horowitz.

Reducto uses JSON Schema for field definitions — familiar and flexible, but limited to generic primitives (string, number, boolean, array, object). A string is a string whether it’s an invoice number, an IBAN, or a street address. Type-specific parsing, normalization, and validation happen in your code.

Their smart model routing picks the right underlying model based on document complexity, which can optimize cost and accuracy. They also offer document splitting and an editing API alongside extraction.

Best for: Teams that want structured extraction with JSON Schema definitions and are comfortable handling type normalization in their own code.

Key limitations:

JSON Schema uses generic types — no purpose-built field types like CURRENCY_AMOUNT or ADDRESS
No source citations as verbatim text (provides bounding boxes instead)
No CALCULATED fields for built-in validation
Credit-based pricing can be unpredictable

Pricing: Credit-based, starting at $0.015+/page. Free credits on sign-up. Growth and Enterprise plans available.

Sensible

Sensible is an API-first extraction platform that bills per document rather than per page. This is a meaningful differentiator — a 100-page mortgage application costs the same as a 1-page receipt.

Their extraction approach uses a combination of layout-based rules and LLM-based queries. You configure extractions per document type using their web editor, defining fields with methods like label, row, table, and LLM-based query for complex cases.

Best for: Teams processing long, variable-length documents (mortgage packets, insurance applications) where per-page pricing would be expensive.

Key limitations:

Configuration-based extraction requires per-document-type setup in their editor
Less flexible than pure schema-based approaches — each document type needs its own configuration
Smaller ecosystem and fewer integrations than cloud providers

Pricing: Free tier with 100 extractions/month. Paid plans from $499/month including 750 extractions. Per-document billing regardless of page count.

Enterprise IDP Platforms

Nanonets

Nanonets is an intelligent document processing platform that targets ops teams as much as developers. It combines OCR, model-based extraction, and workflow automation — approval queues, validation rules, ERP integrations, and human-in-the-loop review.

You train a model per document type by uploading and labeling samples, similar to Azure and Google’s custom model approach. The training interface is friendlier than the cloud providers’, but the fundamental constraint is the same: new document types require new training.

Their block-based pricing changed in January 2025. Each workflow step (“block”) has its own cost — extraction, formatting, lookups, and premium integrations are all billed separately.

Best for: Operations teams automating AP, HR, or procurement workflows that need human review steps and ERP integrations alongside extraction.

Key limitations:

Requires model training per document type
Block-based pricing makes cost estimation complex (~$0.30/page for extraction, plus blocks)
Focused on workflow automation — overkill if you just need an API
Less developer-focused than API-first alternatives

Pricing: Free tier available. Pro plan from $499/month. Block-based usage charges on top. $200 in free credits on sign-up.

Brief Mentions

Several other enterprise IDP platforms serve specific niches:

Docsumo — focuses on financial document automation (invoices, bank statements, insurance claims). Pre-trained models for common financial documents. Good for AP automation teams.
Rossum — positions itself as an AI-powered document gateway. Strong in Europe, especially for invoice processing and supply chain documents. Human-in-the-loop review is a core feature, not an add-on.
Affinda — offers resume parsing, invoice extraction, and document classification. Their resume parser is one of the better ones on the market. Less general-purpose than the other tools here.
Mindee — API-first approach to document parsing with pre-built models for receipts, invoices, passports, and more. Developer-friendly SDKs. Custom model training available.
Extend — targets insurance and financial services with workflow automation around document extraction. Deep vertical focus.
LandingAI — from Andrew Ng’s team, offers agentic document extraction APIs. Vision-model-based approach with per-page pricing.
Base64.ai — multi-model approach that processes 700+ document types. Claims sub-second processing times. Enterprise-focused with compliance certifications.

These platforms share a common trait: they’re designed for specific verticals or enterprise workflows, not general-purpose developer APIs. If your use case aligns with their vertical, they can save significant integration work. If it doesn’t, you’ll fight their assumptions about how documents should be processed.

VLM-Based Approaches

You don’t need a document extraction API at all. You can send a document directly to a vision-language model and ask it to extract what you need. This is increasingly viable in 2026.

Gemini

Gemini — specifically Gemini 3 Pro — currently leads most document understanding benchmarks. Its multimodal architecture processes document images with strong accuracy on tables, handwriting, and complex layouts. With a 1M+ token context window, it can handle extremely long documents in a single call.

You write a prompt describing the fields you want, attach the document, and parse the response. If you add structured output (JSON mode), you get something close to what a purpose-built extraction API provides.

Best for: One-off extractions, prototyping, or use cases where the document types are too varied for any pre-built solution.

Key limitations:

No typed field system — you enforce types in your prompt or your code
No confidence scores — the model either returns a value or doesn’t
No source citations unless you ask for them in the prompt (and trust the model to quote accurately)
Output format depends on prompt engineering — inconsistencies across documents
Token-based pricing makes cost unpredictable for high-volume use
Rate limits and latency aren’t designed for production document pipelines

GPT-5.4

GPT-5.4 handles document images well, though it trails Gemini on complex layouts and dense tables in most benchmarks. OpenAI’s structured outputs feature (JSON mode with a schema) helps with consistency, but you’re still writing prompts to locate and extract fields.

Same tradeoffs as Gemini — maximum flexibility, no guardrails. Good for prototyping. Expensive and inconsistent at scale.

Claude

Claude — specifically Claude Opus 4.6 and Sonnet 4.6 — supports native PDF input. You don’t need to convert the PDF to images first; you send the PDF directly. This preserves text fidelity and avoids OCR errors on digitally-created PDFs.

Claude’s document understanding is strong, especially for contracts, legal documents, and long-form text. Native PDF support means faster processing and higher accuracy on text-heavy documents compared to the screenshot-then-OCR approach.

Same fundamental limitations as other VLMs: no built-in typed fields, no confidence scores, no source citations, and prompt-dependent output consistency.

The VLM Tradeoff

Using a VLM directly gives you the most flexibility and the least structure. You can extract anything from any document — if you write the right prompt. But “if” is doing a lot of work in that sentence.

For prototyping and low-volume extraction, VLMs are the fastest way to get started. For production pipelines processing thousands of documents, you’ll end up building schema enforcement, type validation, confidence scoring, and citation tracking on top of the VLM. At that point, you’ve built a document extraction API — just a worse one, because you’re maintaining it yourself.

Comparison Table

Tool	Approach	Output Format	Typed Fields	Confidence Scores	Multi-Format Input	Pricing Model
Mistral OCR	OCR	Markdown	No	No	PDF, images	Per page
Azure Document Intelligence	Pre-built + custom models	JSON	Partial (per model)	Yes	PDF, images, TIFF	Per page + add-ons
AWS Textract	Multiple APIs	Block JSON	No	Yes (per block)	PDF, images	Per page per feature
Google Document AI	Processor-based	JSON	Partial (per processor)	Yes	PDF, images, TIFF	Per page
LlamaParse	RAG preprocessing	Markdown / text / chunks	No	No	PDF, DOCX, PPTX, XLSX	Credit-based
Unstructured.io	RAG preprocessing	Structured elements	No	No	25+ formats	Pay-as-you-go
Iteration Layer	Schema-based extraction	Typed JSON	Yes (17 types)	Yes	40+ formats (PDF, Office, EPUB, LaTeX, images, email, more)	See pricing page
Reducto	Schema-based extraction	JSON	No (generic types)	Partial	PDF, images, spreadsheets	Credit-based
Sensible	Config-based extraction	JSON	Partial	Yes	PDF, images	Per document
Nanonets	Model-based + workflow	JSON	Partial	Yes	PDF, images	Block-based
Gemini	VLM	Prompt-dependent	No	No	PDF, images	Per token
GPT-5.4	VLM	Prompt-dependent	No	No	Images	Per token
Claude	VLM	Prompt-dependent	No	No	PDF (native), images	Per token

How to Choose

The right tool depends on what you’re building. Here’s a decision tree.

“I need raw text or markdown for a RAG pipeline.” Use LlamaParse or Unstructured.io. They’re purpose-built for this. LlamaParse if you’re in the LlamaIndex ecosystem. Unstructured if you want an open-source core or handle mixed-format document corpora.

“I need cheap OCR at scale and I’ll handle the parsing.” Use Mistral OCR. At $1–$2 per 1,000 pages, nothing else comes close on price. Budget engineering time for the post-processing pipeline you’ll need to turn that markdown into usable data.

“I’m already on AWS / Azure / GCP.” Use the document processing tool from your cloud provider. Textract for AWS, Azure Document Intelligence for Azure, Google Document AI for GCP. The ecosystem integration — IAM, storage, monitoring, billing — is worth the tradeoffs. Just understand that you’re committing to their abstractions (blocks, models, processors).

“I need structured fields from documents without training models.” Use Iteration Layer or Reducto. Both take a schema and return structured JSON. Iteration Layer gives you 17 typed fields, confidence scores, source citations, and multi-file extraction. Reducto uses JSON Schema with generic types and offers smart model routing. Try both — the right choice depends on how much type handling you want in the API versus your code.

“I need workflow automation for AP, HR, or procurement.” Use Nanonets or Docsumo. They bundle extraction with approval queues, validation rules, and ERP integrations. If your bottleneck is the workflow around extraction — not the extraction itself — an IDP platform saves real time.

“I want maximum flexibility and I’ll figure out the rest.” Use a VLM directly — Gemini for highest accuracy, Claude for native PDF support, GPT-5.4 for the OpenAI ecosystem. You get unlimited flexibility and zero structure. Budget for the schema enforcement, validation, and error handling code you’ll write.

“I process long documents and per-page pricing is killing me.” Look at Sensible. Per-document pricing means a 100-page mortgage packet costs the same as a single-page form. If document length varies widely and your volumes are moderate, the economics work in your favor.

What Actually Matters

The document extraction market is fragmented because the problem itself is fragmented. A team building a RAG chatbot needs fundamentally different tooling than a team automating invoice processing. A startup parsing 100 documents a month has different constraints than an enterprise processing 100,000.

Three things separate the tools that work in production from the ones that work in demos:

Type safety. Getting back a string that says “$4,250.00” is not the same as getting back a number with currency metadata. Every untyped value is a parsing bug waiting to happen in a locale you haven’t tested.

Confidence signals. Knowing that the extraction is 95% confident in the invoice total but only 60% confident in the vendor address lets you automate the high-confidence cases and route the rest to human review. Without confidence scores, everything goes to review or nothing does.

Format coverage. Documents arrive as PDFs, scanned images, Word files, Excel exports, and email attachments. A tool that only handles PDFs and images forces you to build conversion pipelines for everything else.

Pick the tool that matches your problem. If you’re building structured extraction pipelines and want typed fields, confidence scores, and multi-format support without training models, start with the docs.

Agencies

Automation Builders

n8n Agencies

EU AI Sovereignty Belongs in the Workflow Layer

EU AI Sovereignty Belongs in the Workflow Layer

“Document Extraction” Means Six Different Things

The Six Categories

OCR Engines

Mistral OCR

Tesseract

Cloud Document Intelligence Platforms

Azure AI Document Intelligence

AWS Textract

Google Document AI

RAG and LLM Preprocessing Tools

LlamaParse

Unstructured.io

Structured Extraction APIs

Iteration Layer

Reducto

Sensible

Enterprise IDP Platforms

Nanonets

Brief Mentions

VLM-Based Approaches

Gemini

GPT-5.4

Claude

The VLM Tradeoff

Comparison Table

How to Choose

What Actually Matters

Related reading

The Complete Guide to Document Parsing in 2026

How to Evaluate Document Extraction APIs

Extracting Structured Data from Scanned Documents: OCR Plus Field Validation

Try with your own data

Document Extraction