Regex and Templates Break. Here's What to Use Instead for Document Parsing

The Regex Graveyard

Every document parsing project starts the same way. You open a PDF, you find the field you need, you write a regex. It works. You test it on a second document — same vendor, same layout — and it works again. You ship it.

Then the vendor updates their invoice template. Or a new vendor sends documents with a different structure. Or someone uploads a scanned copy instead of a digital PDF. Your regex, which was never robust in the first place, starts returning garbage. Silently.

So you add more regex. Special cases for vendor A, vendor B, vendor C. Exception handling for scans. Fallback patterns for the edge cases. Six months later you have 2,000 lines of regex that nobody wants to touch, and it still breaks on every tenth document.

Templates Aren’t Much Better

Template-based parsers solve the layout problem — sort of. You define bounding boxes or anchor points for each document format. “The invoice number is always at coordinates (x: 450, y: 120).” As long as every document matches the template, it works.

But documents don’t stay still. A vendor reformats their invoices. A government agency releases a new version of a form. A scanned document comes in slightly rotated. The bounding boxes are off by 20 pixels and the extraction fails.

Templates also don’t scale. If you process documents from 50 different vendors, you need 50 templates. If a vendor has two invoice formats (one for domestic, one for international), that’s two more templates. Maintaining a template library becomes a full-time job.

The Core Problem

Regex and templates both make the same assumption: the document layout is predictable. For invoices from one vendor, maybe. For receipts from one POS system, maybe. For any kind of scale or variation — no.

What you actually want is extraction that adapts to the document. You describe what data you need, and the parser figures out where it lives — regardless of format, layout, or vendor.

Schema-Based Extraction

The Document Extraction API flips the approach. Instead of describing where data is, you describe what it is.

A schema is an array of field definitions:

const schema = {
  fields: [
    {
      name: "invoice_number",
      type: "TEXT",
      description: "The unique invoice identifier",
      is_required: true,
    },
    {
      name: "total_amount",
      type: "CURRENCY_AMOUNT",
      description: "Total amount due on the invoice",
      is_required: true,
    },
    {
      name: "due_date",
      type: "DATE",
      description: "Payment due date",
    },
  ],
};

That schema works for any invoice layout. The vendor can put the invoice number at the top, bottom, or middle of the page. The total can be labeled “Total,” “Amount Due,” “Summe,” or “Montant.” The parser reads the document and finds the data that matches your field descriptions.

No coordinates. No regex. No vendor-specific rules.

Why This Works

Schema-based extraction uses AI to understand document content — not just character positions. The parser reads the document the way a human would: it sees the text, understands the context, and identifies which parts match your field descriptions.

This handles:

Layout variation — fields can appear anywhere on the page
Label variation — “Total Due,” “Amount Payable,” “Gesamtbetrag” all match a field described as “total amount”
Format mixing — digital PDFs, scanned documents, and photos all go through the same pipeline
Multi-format documents — the same schema extracts from PDF, DOCX, and image files

Confidence Scores Replace Silent Failures

The biggest problem with regex parsers isn’t that they fail — it’s that they fail silently. A regex that matches the wrong text still returns a result. You don’t know the extraction was wrong until a human catches it downstream (or doesn’t).

The Document Extraction returns a confidence score between 0.0 and 1.0 for every extracted field. You know immediately how much to trust each value. Build your pipeline around thresholds: auto-accept above 0.90, flag for review between 0.70 and 0.90, reject below 0.70.

17 Field Types vs. Text Regex

A regex parser gives you text. What the text means — is it a number? a date? an address? — that’s your problem to figure out.

The Document Extraction has 17 purpose-built field types. When you define a field as CURRENCY_AMOUNT, you get a numeric value with proper decimal handling. When you define ADDRESS, you get a decomposed object with street, city, region, postal code, and country. When you define IBAN, you get a validated IBAN string, not just whatever matched a pattern.

This pushes validation and normalization into the extraction step, where it belongs. Your application code receives clean, typed data instead of raw strings that need post-processing.

Side-by-Side: Regex vs. Schema

Here’s a concrete comparison. Suppose you need to extract an invoice total from a document.

The regex approach:

// Works for "Total: $1,234.56"
const TOTAL_REGEX = /Total:\s*\$?([\d,]+\.\d{2})/;
const match = documentText.match(TOTAL_REGEX);
const total = match ? parseFloat(match[1].replace(",", "")) : null;

That regex handles one label format (“Total:”), one currency symbol (“$”), and one decimal convention (period). A European invoice with “Gesamtbetrag: 1.234,56 EUR” breaks it. An invoice that says “Amount Due” instead of “Total” breaks it. A scanned PDF where OCR misreads a character breaks it. Each edge case is another regex, another branch, another test.

The schema approach:

{
  name: "total_amount",
  type: "CURRENCY_AMOUNT",
  description: "Total amount due on the invoice",
  is_required: true,
}

One field definition. It handles “Total”, “Amount Due”, “Gesamtbetrag”, and “Montant total”. It handles periods and commas as decimal separators. It handles scanned documents via built-in OCR. And it returns a confidence score telling you how reliable the extraction is.

Migrating from Regex to Schema-Based Extraction

If you already have a regex-based parser, migrating is straightforward. Take your existing regex patterns and translate each one into a field definition.

Start by listing every field your regex extracts. For each field, define a schema entry with the appropriate type. A regex that captures a date becomes a DATE field. A regex that captures a number becomes CURRENCY_AMOUNT, INTEGER, or DECIMAL depending on the context. A regex that captures an address becomes an ADDRESS field.

Run both parsers in parallel for a validation period. Send the same documents through both and compare the results. The schema-based parser will match your regex results on clean documents and outperform them on messy ones. Once you’re satisfied with the comparison, remove the regex code.

The migration doesn’t need to happen all at once. You can move one document type at a time — start with the vendor whose invoices break your regex most often.

When Regex Still Makes Sense

Regex isn’t always wrong. If you control the document format (you generate it yourself), the layout never changes, and you’re extracting one or two simple fields — regex is fine. It’s fast, it’s deterministic, and it has zero cost per extraction.

But if you’re processing documents from external sources, with any layout variation, or extracting more than a handful of fields — schema-based extraction saves you from the maintenance spiral.

Get Started

Check the docs for the full field type reference and schema definition guide. The TypeScript and Python SDKs handle the API integration, so you can replace your regex parser with a few lines of code.

And because Document Extraction is part of a composable API suite, the structured data it returns flows directly into Document Generation or Image Generation — same auth, same credit pool, no glue code.

Iteration Layer runs on EU infrastructure (Frankfurt), which matters if your data residency requirements rule out US-hosted services.

Sign up for a free account — no credit card required. Try your current document corpus against a schema and compare the results to what your regex produces.

Ingest

Transform

Generate

Categories

Featured

Overview

APIs

Integrations