Document Extraction vs Reducto: Two Approaches to Structured Extraction

7 min read Document Extraction

Most “Competitors” Aren’t Competing

Most document processing APIs stop at OCR. They take a PDF, run it through optical character recognition, and hand you back markdown or plain text. What you do with that text — how you find the invoice number, parse the line items, validate the totals — that’s your problem.

Reducto is different. Their Extract API does what most OCR tools don’t: it takes a schema, reads the document, and returns structured JSON with the fields you asked for. That puts them in the same category as the Iteration Layer Document Extraction API. Not the OCR-and-good-luck category. The actual structured extraction category.

This post is a direct comparison of the two. Reducto is a well-funded company ($108M total, including a $75M Series B led by Andreessen Horowitz) with good accuracy and a solid engineering team. They deserve the comparison, not a dismissal.

But the two products make different design decisions. Those decisions matter when you’re choosing which one to build on.

Schema Definition: JSON Schema vs. Typed Fields

Reducto’s Extract API uses JSON Schema to define what you want to extract. You specify property names, types (string, number, boolean, array, object), and descriptions. The LLM uses those descriptions to locate values in the document.

This works. JSON Schema is familiar, flexible, and well-documented. But it also means you’re working with generic primitives. A string is a string — whether it holds an invoice number, an IBAN, or a street address.

Iteration Layer takes a different approach: 17 purpose-built field types. TEXT, TEXTAREA, INTEGER, DECIMAL, BOOLEAN, DATE, DATETIME, TIME, EMAIL, IBAN, COUNTRY, CURRENCY_CODE, CURRENCY_AMOUNT, ADDRESS, ARRAY, ENUM, and CALCULATED.

The difference shows up in what you get back. Define a field as CURRENCY_AMOUNT and you get a numeric value with proper decimal handling — no parsing “$1,234.56” from a string yourself. Define ADDRESS and the API auto-decomposes it into street, city, region, postal code, and country. You don’t write that normalization logic. It’s built into the type system.

CALCULATED fields are where this gets interesting. Define a field that references other extracted fields — say, unitPrice * quantity — and the API computes it during extraction. You can use this for validation (does the computed total match the stated total?) or to derive values the document doesn’t explicitly contain.

With JSON Schema, all of that post-processing lives in your code. With typed fields, it lives in the extraction step.

File Format Support: OCR Everything vs. Parse Natively

Reducto processes PDFs, images, and spreadsheets. Their Parse step handles OCR, layout detection, and table parsing before the Extract step pulls out your schema fields. They support 30+ formats according to their docs.

Iteration Layer also handles PDFs and images, but adds native parsing for DOCX, XLSX, CSV, JSON, HTML, Markdown, and plain text. “Native” is the key word. An Excel file isn’t OCR’d — it’s parsed as structured data. A JSON file isn’t treated as an image of text — it’s read as JSON.

This matters because OCR introduces errors. Even good OCR occasionally misreads characters, especially in tables with dense numeric data. If the source file is already structured — a spreadsheet, a CSV export, a JSON payload — there’s no reason to OCR it. Parsing it natively is faster and more accurate.

If your pipeline only handles scanned PDFs and photos, this distinction doesn’t matter much. If you process a mix of digital documents, spreadsheets, and structured files alongside scans, native parsing avoids a whole class of errors.

Source Citations: Where Did This Value Come From?

Both products offer some form of provenance for extracted values. Reducto provides bounding boxes and context references when citations are enabled. Iteration Layer returns source citations — verbatim text from the document that the extracted value came from.

The difference is in how you use them. Bounding boxes tell you where on the page a value was found. Source citations tell you what text the model read to produce the value. For audit trails and human review, the verbatim source text is often more useful than pixel coordinates. A reviewer can glance at the citation and confirm the extraction without opening the original document.

Both approaches have their place. If you need to highlight regions in a document viewer, bounding boxes are better. If you need a human to verify an extraction in a review queue, verbatim citations are faster to check.

Multi-File Extraction

Iteration Layer accepts up to 20 files per extraction request. The files are combined and treated as a single document for extraction purposes.

This matters for real-world document workflows. A supplier sends their catalog as three separate PDFs. An insurance claim comes with a form, a photo of the damage, and a receipt. A loan application spans a bank statement, a pay stub, and a tax return. Instead of extracting from each file separately and merging the results in your code, you send them all in one request and get one unified response.

import { IterationLayer } from "iterationlayer";

const client = new IterationLayer({ apiKey: "YOUR_API_KEY" });

const result = await client.extract({
  files: [
    { url: "https://example.com/catalog-page-1.pdf" },
    { url: "https://example.com/catalog-page-2.pdf" },
  ],
  schema: {
    fields: [
      { name: "products", type: "array", fields: [
        { name: "name", type: "text" },
        { name: "sku", type: "text" },
        { name: "price", type: "currency_amount" },
        { name: "category", type: "enum", values: ["Electronics", "Clothing", "Home", "Sports"] },
      ]},
    ],
  },
});

Two files, one schema, one response. The array of products spans both catalog pages. No merge logic on your side.

Reducto’s Extract API processes one document per request. Multi-document workflows require multiple API calls and client-side merging. Their Split API handles a related but different problem — splitting a single PDF that contains multiple logical documents (like a stack of invoices in one file).

MCP Integration

Every Iteration Layer API ships as an MCP server. If you use Claude, Cursor, or any MCP-compatible client, the Document Extraction API shows up as a tool your agent can call directly.

This means you can say “extract the line items from this invoice” in your IDE, and the agent discovers the API, constructs the schema, makes the call, and returns the structured data. No integration code, no SDK setup — the agent handles it.

Reducto provides SDKs for Python and TypeScript, plus a REST API. Standard integration paths. But no MCP support, which means no native agent integration without writing the glue code yourself.

Whether MCP matters to you depends on your workflow. If you’re building traditional backend pipelines, SDKs are fine. If you’re building agent-powered workflows or want to use extraction as a tool in an AI coding environment, MCP is a meaningful advantage.

Where Reducto Has the Edge

Being fair: Reducto does things Iteration Layer doesn’t.

  • Smart model routing. Reducto picks the optimal model for each region of a document — one model for tables, another for handwritten text, another for printed paragraphs. This can improve accuracy on complex documents with mixed content types.
  • Document splitting. Their Split API segments a multi-document PDF into individual documents. If a scanner produces a single PDF containing 50 invoices, Split identifies the boundaries and gives you 50 separate documents. Iteration Layer doesn’t have this.
  • Edit API. Reducto can fill PDF forms and modify DOCX files. That’s a different product category, but it’s useful if you need round-trip document processing.
  • Enterprise compliance. SOC 2 Type I/II, HIPAA support with BAA, zero data retention options. If you’re in healthcare or financial services with strict compliance requirements, Reducto has the certifications.

These aren’t small things. Smart model routing genuinely helps with complex documents. Document splitting solves a real problem in scanning-heavy workflows. If those capabilities are central to your use case, Reducto deserves serious consideration.

When to Choose What

Pick Reducto if:

  • You primarily process scanned PDFs and images
  • You need document splitting for multi-document PDF files
  • SOC 2 / HIPAA compliance is a hard requirement today
  • Smart model routing for mixed-content documents is important to your accuracy

Pick Iteration Layer if:

  • You process a mix of PDFs, spreadsheets, DOCX, CSV, JSON, and other digital formats
  • You want typed field extraction (ADDRESS decomposition, CALCULATED fields, IBAN validation) without post-processing
  • You need multi-file extraction — combining several documents into one request
  • You’re building agent-powered workflows and want native MCP integration
  • Source citations matter for your audit trail or human review process

Get Started

Check the docs for the full schema reference, field type definitions, and SDK guides. The TypeScript and Python SDKs handle authentication and request construction, so integration is a few lines of code.

And because Document Extraction is part of a composable API suite, the structured data it returns flows directly into Document Generation or Image Generation — same auth, same credit pool, no glue code.

Iteration Layer runs on EU infrastructure (Frankfurt), which matters if your data residency requirements rule out US-hosted services.

Sign up for a free account — no credit card required. Try your actual documents against a schema and see what comes back.

Start building in minutes

Free trial included. No credit card required.