The Complete Guide to Document Parsing in 2026

18 min read Document Extraction

What Document Parsing Actually Is

Document parsing is the process of extracting structured data from unstructured documents. You have a PDF, a scanned image, a Word file. Somewhere inside it there is an invoice number, a total amount, a date, a list of line items. Your system needs those values as clean, typed fields — not as pixels on a page.

This sounds simple until you realize that documents are not databases. They have no fixed schema. Two invoices from two vendors have the same logical fields — invoice number, total, due date — but completely different layouts. A receipt from a coffee shop looks nothing like a receipt from a hardware store. A medical form from one hospital has different sections than one from another.

Document parsing bridges this gap. It takes a visual, human-readable document and produces machine-readable data. How you do it — and how well it works — depends entirely on the approach.

Why It Matters

Every business process that touches paper or PDFs needs document parsing. Accounts payable processes invoices. HR onboarding parses IDs and tax forms. Insurance processes claims. Logistics reads shipping manifests. Real estate deals involve stacks of contracts, appraisals, and disclosures.

Manual data entry is the default. Someone opens the document, reads the values, types them into a system. This is slow, expensive, and error-prone. A data entry operator makes mistakes. They get tired. They skip fields. They transpose digits. At scale, the error rate compounds.

Automation is the goal. But the specific method matters. A bad automation system is worse than manual entry — it introduces errors silently, and nobody catches them until a customer complains or an audit fails.

The Evolution of Document Parsing

Document parsing has gone through distinct phases, each solving one problem while creating new ones.

Manual data entry was the starting point. Humans read documents and type values into forms. Accurate when done carefully, but slow and unscalable. Still the default in many organizations.

Regex and string matching came next. Developers write patterns to match specific text in extracted document text. Fast, deterministic, and brittle. Works perfectly until the document format changes.

Template-based extraction added spatial awareness. Define bounding boxes or anchor points for each document layout. If the invoice number is always at position (x: 450, y: 120), draw a box there and extract what’s inside. Precise for known templates, useless for unknown ones.

AI-based extraction is the current generation. Use machine learning to understand document content semantically. Define what data you want — the model finds it regardless of layout, language, or format. More flexible, but requires proper schema design and confidence handling.

Each approach is still used today. The right choice depends on your documents, your scale, and your tolerance for maintenance.

The Regex Approach

Regex parsing is the first thing most developers try. Extract the text from a document, then write regular expressions to find the values you need.

Here is a typical regex-based invoice parser in TypeScript:

const extractInvoiceData = (text: string) => {
  const invoiceNumberMatch = text.match(/Invoice\s*#?\s*:?\s*([A-Z0-9-]+)/i);
  const totalMatch = text.match(/Total\s*:?\s*\$?\s*([\d,]+\.?\d*)/i);
  const dateMatch = text.match(/Date\s*:?\s*(\d{1,2}[\/\-]\d{1,2}[\/\-]\d{2,4})/i);

  return {
    invoiceNumber: invoiceNumberMatch?.[1] ?? null,
    total: totalMatch?.[1] ? parseFloat(totalMatch[1].replace(",", "")) : null,
    date: dateMatch?.[1] ?? null,
  };
};

This works for a specific invoice format. If the document says “Invoice #INV-2024-001” and “Total: $1,250.00” and “Date: 02/15/2026”, the regex matches.

Where Regex Breaks

The problems start immediately in production.

Label variation. One vendor writes “Invoice Number”, another writes “Invoice #”, another writes “Rechnungsnummer”. Your regex handles the first two. The third one? You add another pattern. Then another vendor writes “Factura No.” and another writes “Bill ID”. The regex grows into a monster.

Format variation. Dates appear as “02/15/2026”, “15.02.2026”, “February 15, 2026”, “2026-02-15”, and “15-Feb-26”. Currency appears as “$1,250.00”, “1.250,00 EUR”, “USD 1250”, and “1,250”. Each variation needs its own pattern.

Layout changes. The vendor updates their invoice template. The label that was on the same line as the value is now on the line above. Your regex, which assumes “Total: $1,250.00” is on one line, stops matching.

Scanned documents. OCR text is noisy. “Invoice #INV-2024-001” becomes “lnvoice #lNV-2O24-OO1”. The “I” becomes an “l”, the “0” becomes an “O”. Your regex fails silently.

Multi-line values. Addresses, line item descriptions, and notes span multiple lines. Regex is fundamentally line-oriented. Extracting multi-line structured data requires increasingly complex patterns that are nearly impossible to maintain.

Regex works for a controlled, single-format pipeline. For anything with variation — which is most real-world document processing — it becomes a maintenance burden that grows with every new document type.

The Template Approach

Template-based parsing adds spatial awareness. Instead of matching text patterns, you define zones on the page where specific data appears.

The concept looks like this:

const invoiceTemplate = {
  invoiceNumber: { page: 1, x: 450, y: 120, width: 200, height: 30 },
  total: { page: 1, x: 400, y: 680, width: 150, height: 30 },
  date: { page: 1, x: 450, y: 150, width: 150, height: 30 },
  lineItems: { page: 1, x: 50, y: 250, width: 500, height: 400 },
};

You map each field to a physical region on the page. The parser extracts text from that region. For structured, consistent documents — like government forms with fixed layouts — this works well.

Where Templates Break

One template per layout. If you process invoices from 50 vendors, you need 50 templates. If a vendor uses different layouts for different invoice types, multiply accordingly. Template libraries become large and expensive to maintain.

Sensitivity to layout changes. A vendor adds a new address line. Everything below shifts down by 20 pixels. Your bounding boxes are now off. The “total” field captures part of a line item instead. You don’t get an error — you get wrong data.

Scan variation. Scanned documents are never perfectly aligned. A slight rotation, a different scan resolution, or a feed misalignment shifts everything by a few pixels. Templates that work on digital PDFs fail on scans of the same document.

No semantic understanding. Templates don’t know what a field means. They know where it is. If a vendor moves the invoice number from the top-right to the top-left, the template extracts nothing — or worse, extracts the wrong value.

Scale ceiling. Template management becomes a dedicated role. Someone needs to create templates for new document types, update templates when layouts change, and debug extraction errors caused by template drift. This is a full-time job for a production system processing diverse documents.

AI-Based Extraction

AI-based document parsing changes the paradigm. Instead of telling the parser where data lives (templates) or what data looks like (regex), you tell it what data means.

You define a schema — a list of fields with names, types, and descriptions. The AI reads the document, understands its content, and maps the content to your schema. It doesn’t care whether the invoice number is at the top or bottom of the page. It doesn’t care whether the label says “Invoice #” or “Rechnungsnummer”. It reads the document like a human would and extracts the data.

This is schema-based extraction. The schema is the interface between you and the parser.

The Document Extraction API

The Document Extraction API implements schema-based extraction. You send documents and a schema. You get structured JSON back. Here is what that looks like.

Defining a Schema

A schema is an array of field definitions. Each field has a name, a type, and a description:

const schema = {
  fields: [
    {
      name: "invoiceNumber",
      type: "TEXT",
      description: "The unique invoice identifier, e.g., INV-2024-001",
      is_required: true,
    },
    {
      name: "vendorName",
      type: "TEXT",
      description: "The name of the company that issued the invoice",
      is_required: true,
    },
    {
      name: "invoiceDate",
      type: "DATE",
      description: "The date the invoice was issued",
      is_required: true,
    },
    {
      name: "dueDate",
      type: "DATE",
      description: "The payment due date",
    },
    {
      name: "totalAmount",
      type: "CURRENCY_AMOUNT",
      description: "The total amount due on the invoice",
      is_required: true,
    },
    {
      name: "currency",
      type: "CURRENCY_CODE",
      description: "The currency of the invoice, e.g., USD, EUR",
      is_required: true,
    },
  ],
};

Sending the Request

import { IterationLayer } from "iterationlayer";

const client = new IterationLayer({ apiKey: "YOUR_API_KEY" });

const result = await client.extract({
  files: [
    { type: "url", name: "invoice-2024-001.pdf", url: "https://example.com/invoice-2024-001.pdf" },
  ],
  schema,
});

The Response

{
  "success": true,
  "data": [
    {
      "fileName": "invoice-2024-001.pdf",
      "fields": {
        "invoiceNumber": {
          "value": "INV-2024-001",
          "confidence": 0.98
        },
        "vendorName": {
          "value": "Acme Corporation",
          "confidence": 0.95
        },
        "invoiceDate": {
          "value": "2024-11-15",
          "confidence": 0.97
        },
        "dueDate": {
          "value": "2024-12-15",
          "confidence": 0.93
        },
        "totalAmount": {
          "value": 1250.00,
          "confidence": 0.99
        },
        "currency": {
          "value": "USD",
          "confidence": 0.96
        }
      }
    }
  ]
}

Same data, any layout. The regex example from earlier needed specific patterns for “Invoice #”, “Total:”, and “Date:”. This schema works regardless of how those labels appear in the document.

Compare: Regex vs API

The regex approach for an invoice:

const invoiceNumberMatch = text.match(/Invoice\s*#?\s*:?\s*([A-Z0-9-]+)/i);
const totalMatch = text.match(/Total\s*:?\s*\$?\s*([\d,]+\.?\d*)/i);
const dateMatch = text.match(/Date\s*:?\s*(\d{1,2}[\/\-]\d{1,2}[\/\-]\d{2,4})/i);

Three patterns that handle one layout. The API approach:

const schema = {
  fields: [
    { name: "invoiceNumber", type: "TEXT", description: "The unique invoice identifier" },
    { name: "totalAmount", type: "CURRENCY_AMOUNT", description: "Total amount due" },
    { name: "invoiceDate", type: "DATE", description: "Date the invoice was issued" },
  ],
};

Three field definitions that handle any layout. One is code you maintain and debug. The other is a declaration of intent.

Field Types Deep Dive

The API supports 17 field types. Each type tells the parser what kind of value to expect and how to validate and format it.

Basic Types

TEXT is the most flexible. Use it for any freeform text — names, identifiers, descriptions, notes. The parser extracts the text as-is.

TEXTAREA is for longer freeform text — multi-line descriptions, notes, summaries. Use it when the expected value spans multiple sentences or paragraphs.

INTEGER extracts whole numbers. The parser handles format variation — “1,000”, “1.000”, “1000” all parse to the integer 1000.

DECIMAL handles numbers with fractional parts. Same format normalization as INTEGER, but preserves decimals.

DATE normalizes dates to ISO 8601 format (YYYY-MM-DD). The parser handles “02/15/2026”, “15.02.2026”, “February 15, 2026”, “15-Feb-26”, and other common date formats. No more date-parsing regex.

BOOLEAN extracts yes/no, true/false, checked/unchecked values. Useful for form fields like “Is this a final invoice?” or “Terms accepted?”

ENUM restricts values to a predefined list. Define the allowed values in the field’s description — “Payment method: one of WIRE, CHECK, CREDIT_CARD, ACH”. The parser maps the document’s text to one of your allowed values.

Financial Types

CURRENCY_AMOUNT extracts monetary values. Handles “$1,250.00”, “1.250,00”, “EUR 1250”, and other currency format variations. Returns a normalized number.

CURRENCY_CODE extracts the currency — “USD”, “EUR”, “GBP”. The parser infers the currency from symbols ($, EUR prefix), explicit labels, or context.

IBAN extracts International Bank Account Numbers with format validation. The parser normalizes spacing and validates the structure.

Contact and Location Types

ADDRESS is a structured type. Instead of returning an address as a single string, the parser decomposes it into components — street, city, state/province, postal code, country. This saves you from writing your own address parser.

EMAIL extracts email addresses with format validation.

COUNTRY normalizes country names to standard codes. “United States”, “USA”, “US”, and “United States of America” all map to a consistent value.

Complex Types

ARRAY extracts repeating items — line items on an invoice, skills on a resume, experience entries on a CV, clauses in a contract. Each item in the array has its own sub-schema:

{
  name: "lineItems",
  type: "ARRAY",
  description: "Individual items billed on the invoice",
  item_schema: {
    fields: [
      {
        name: "description",
        type: "TEXT",
        description: "Description of the item or service",
      },
      {
        name: "quantity",
        type: "INTEGER",
        description: "Number of units",
      },
      {
        name: "unitPrice",
        type: "CURRENCY_AMOUNT",
        description: "Price per unit",
      },
      {
        name: "amount",
        type: "CURRENCY_AMOUNT",
        description: "Total amount for this line item",
      },
    ],
  },
}

The parser identifies the repeating structure in the document — a table, a list, a series of blocks — and extracts each item with its sub-fields.

CALCULATED fields derive values from other extracted fields. Define a formula in the description:

{
  name: "taxAmount",
  type: "CALCULATED",
  description: "Calculated as totalAmount - subtotal",
}

The parser computes the value from the other fields it has extracted. This is useful for validation — if the calculated value doesn’t match what appears in the document, you know something is off.

Confidence Scores and Human-in-the-Loop

Every extracted field comes with a confidence score between 0.0 and 1.0. This score tells you how confident the parser is in the extracted value.

High confidence (0.9+) means the parser is very sure. The text was clear, the field type matched, and there was no ambiguity. Low confidence (below 0.7) means the parser found something but isn’t certain — maybe the text was partially occluded, maybe the OCR was noisy, maybe there were multiple candidate values.

Confidence scores enable human-in-the-loop workflows. Instead of reviewing every extracted document, you set a threshold. Fields above the threshold pass through automatically. Fields below the threshold get flagged for human review.

const CONFIDENCE_THRESHOLD = 0.85;

const flaggedFields = Object.entries(result.fields)
  .filter(([, field]) => field.confidence < CONFIDENCE_THRESHOLD)
  .map(([name, field]) => ({
    fieldName: name,
    extractedValue: field.value,
    confidence: field.confidence,
  }));

if (flaggedFields.length > 0) {
  await sendToReviewQueue(result.fileName, flaggedFields);
} else {
  await processAutomatically(result);
}

This gives you the best of both worlds. Most documents process fully automatically. The edge cases — scanned copies, unusual layouts, ambiguous values — get human attention. Your throughput goes up, your error rate goes down, and your reviewers focus on the cases that actually need human judgment.

Tuning the Threshold

The right threshold depends on your use case and tolerance for errors.

Financial data (invoices, payments): Set it high — 0.90 or above. A wrong amount or a wrong account number has real consequences. Better to flag more documents for review than to process bad data.

Informational data (content metadata, article parsing): Set it lower — 0.75 to 0.85. A slightly wrong author name or publication date is less costly. You want throughput.

Compliance data (contracts, legal documents): Set it very high — 0.95+. Misextracted clauses or dates in legal documents create liability. Flag aggressively.

Start conservative and relax the threshold as you gain confidence in the extraction quality for your specific document types.

Batch Processing

The API accepts up to 20 files per request, with a maximum of 50MB per file and 200MB total per request. This enables batch processing without multiple API calls.

const result = await client.extract({
  files: invoiceFiles.map((url, index) => ({
    type: "url",
    name: `invoice-${index}.pdf`,
    url,
  })),
  schema,
});
// result.data is an array — one entry per file

The response contains one entry per file, each with its own extracted fields and confidence scores. Files that fail — corrupt PDFs, unsupported formats — return errors in their entry without failing the entire batch.

For larger volumes, send batches sequentially or in parallel. A batch of 20 invoices with the same schema processes as a single request. A thousand invoices become 50 requests.

Built-in OCR

Scanned documents — photos of receipts, scanned contracts, faxed invoices — need OCR before the text can be parsed. The API handles this automatically. If a document is a scanned image or an image-based PDF, the built-in OCR extracts the text, and the parser works on the OCR output.

You don’t need to preprocess scans. You don’t need a separate OCR step. Send the scanned document the same way you send a digital PDF. The API detects that it needs OCR and applies it.

This removes an entire layer of infrastructure. No Tesseract installation. No OCR preprocessing pipeline. No handling of OCR confidence levels separately from extraction confidence. The end-to-end confidence score already accounts for OCR quality.

Building a Complete Pipeline

Here is a complete document parsing pipeline — schema definition, batch extraction, confidence filtering, and result handling:

const schema = {
  fields: [
    {
      name: "invoiceNumber",
      type: "TEXT",
      description: "The unique invoice identifier",
      is_required: true,
    },
    {
      name: "vendorName",
      type: "TEXT",
      description: "The name of the company that issued the invoice",
      is_required: true,
    },
    {
      name: "invoiceDate",
      type: "DATE",
      description: "The date the invoice was issued",
      is_required: true,
    },
    {
      name: "totalAmount",
      type: "CURRENCY_AMOUNT",
      description: "The total amount due",
      is_required: true,
    },
    {
      name: "currency",
      type: "CURRENCY_CODE",
      description: "The invoice currency",
      is_required: true,
    },
    {
      name: "vendorIban",
      type: "IBAN",
      description: "The vendor's bank account IBAN",
    },
    {
      name: "vendorAddress",
      type: "ADDRESS",
      description: "The vendor's billing address",
    },
    {
      name: "lineItems",
      type: "ARRAY",
      description: "Individual items on the invoice",
      item_schema: {
        fields: [
          { name: "description", type: "TEXT", description: "Item description" },
          { name: "quantity", type: "INTEGER", description: "Number of units" },
          { name: "unitPrice", type: "CURRENCY_AMOUNT", description: "Price per unit" },
          { name: "amount", type: "CURRENCY_AMOUNT", description: "Line total" },
        ],
      },
    },
    {
      name: "subtotal",
      type: "CURRENCY_AMOUNT",
      description: "Sum of all line item amounts before tax",
    },
    {
      name: "taxAmount",
      type: "CALCULATED",
      description: "Calculated as totalAmount - subtotal",
    },
  ],
};

const CONFIDENCE_THRESHOLD = 0.90;

const processInvoiceBatch = async (fileUrls: string[]) => {
  const { data: results } = await client.extract({
    files: fileUrls.map((url, index) => ({
      type: "url",
      name: `invoice-${index}.pdf`,
      url,
    })),
    schema,
  });

  for (const result of results) {
    const lowConfidenceFields = Object.entries(result.fields)
      .filter(([, field]) => field.confidence < CONFIDENCE_THRESHOLD);

    if (lowConfidenceFields.length > 0) {
      console.log(`Flagging ${result.fileName} for review`);
      // Route to human review queue
    } else {
      console.log(`Auto-processing ${result.fileName}`);
      // Insert into accounting system
    }
  }
};

This handles digital PDFs and scanned documents alike. The schema covers basic fields, financial fields (CURRENCY_AMOUNT, CURRENCY_CODE, IBAN), structured fields (ADDRESS, ARRAY), and validation (CALCULATED). The confidence threshold routes uncertain results to human review.

When to Use Which Approach

Choosing the right approach depends on three factors: document variation, volume, and accuracy requirements.

Use regex when:

  • You have a single document format that never changes
  • The text extraction is clean (digital PDFs, not scans)
  • You need maximum speed and zero external dependencies
  • You’re parsing structured text (logs, CSVs) rather than visual documents

Regex is a good fit for internal tool outputs, API responses, and machine-generated reports. It is a bad fit for vendor documents, customer uploads, or anything where you don’t control the format.

Use templates when:

  • You have a small number of fixed-layout document types (under 10)
  • The documents are always digital, never scanned
  • The layout genuinely never changes (government forms with mandated layouts)
  • You need zone-level extraction (specific regions of a page)

Templates work for highly controlled environments — a specific government form, a specific bank statement format. They break down when document types multiply or layouts drift.

Use AI-based extraction when:

  • You process documents from multiple sources with varying layouts
  • Documents include scans, photos, or mixed-quality inputs
  • You need to handle multiple languages
  • The document types evolve over time (new vendors, updated forms)
  • You want to define what to extract, not where to find it
  • You need confidence scores for quality control

AI-based extraction handles the real world — diverse documents, inconsistent quality, evolving formats. The tradeoff is that it requires a well-designed schema and a strategy for handling low-confidence results.

Decision Matrix

Factor Regex Templates AI Extraction
Setup time Minutes Hours per template Minutes per schema
Document variation One format Few fixed formats Any format
Scanned documents Poor Poor Built-in OCR
Multiple languages Manual per language N/A Automatic
Maintenance High (grows with formats) High (grows with templates) Low (schema is stable)
Accuracy on known formats High High High
Accuracy on unknown formats Fails Fails Degrades gracefully
Cost per document Free (compute only) Free (compute only) API pricing

Schema Design Best Practices

The quality of AI-based extraction depends on the quality of your schema. A well-designed schema produces better results than a vague one.

Be specific in descriptions. “The invoice number” is okay. “The unique invoice identifier, usually formatted as INV-YYYY-NNN, found near the top of the document” is better. Descriptions help the parser disambiguate when multiple candidate values exist.

Use the right field type. Don’t use TEXT for a date. Don’t use DECIMAL for a currency amount. Field types carry semantic meaning — CURRENCY_AMOUNT tells the parser to handle format variation (commas vs. periods, currency symbols, thousands separators) that TEXT would miss.

Mark required fields. The is_required flag tells the parser that a missing value is an error, not an expected absence. Use it for fields that should always be present. Don’t use it for fields that are genuinely optional (like a PO number on invoices that don’t always have one).

Use CALCULATED for validation. If your document contains a subtotal and a total, extract both and add a CALCULATED field for the tax. If the calculated tax doesn’t match the document’s tax line, the confidence score reflects the inconsistency. This is a built-in sanity check.

Design ARRAY sub-schemas carefully. Line items on invoices, experience entries on resumes, clauses in contracts — these are all arrays. The sub-schema should cover the fields that are consistently present in each item. Don’t include fields that only appear in some items as required.

What’s Next

Document parsing has moved from brittle pattern matching to flexible, schema-driven extraction. The tools are available today. If you’re processing documents at any scale — invoices, receipts, contracts, resumes, forms — the approach matters more than the tool.

Start with your most common document type. Define a schema that covers the fields your system needs. Send a batch through the Document Extraction API and check the confidence scores. Tune the schema descriptions. Set a confidence threshold. Build the review workflow for edge cases.

The API documentation has the full field type reference, request format details, and response examples. Try it with your actual documents — that is the fastest way to see what schema-based extraction can do for your pipeline.

Start building in minutes

Free trial included. No credit card required.