Extracting Structured Data from Scanned Documents: OCR Plus Field Validation

The Filing Cabinet Problem

Every organization has one. A storage room, a shared drive, a Dropbox folder — somewhere there are thousands of documents that exist only as scans. Supplier invoices from before the accounting system went digital. Patient intake forms from a decade of paper processes. Lease agreements that were faxed, signed, scanned, and filed away. Customs declarations. Insurance claims. Building permits.

The data inside those documents is valuable. It is also trapped behind a wall of pixels. A scanned PDF is not a document in any meaningful sense — it is a photograph of a document, wrapped in a PDF container. You cannot search it. You cannot copy text from it. You cannot query a database for “all invoices over EUR 10,000 from 2023” when those invoices are flat images.

The traditional fix is OCR — optical character recognition. Run Tesseract, get text out. But raw OCR gives you a stream of characters with no structure. An invoice number, a date, an address, and a line item table all come back as one unstructured blob. You still need to write parsers to separate the fields, regex to validate the formats, and error handling for the dozens of ways scanned documents degrade — skewed scans, coffee stains, faded ink, low-resolution mobile camera captures.

That is two problems, not one. OCR converts pixels to characters. Extraction converts characters to structured data. Most tools solve the first and leave the second to you.

Scanned vs. Digital: Two Kinds of PDF, Same Extraction Problem

Before diving into the approach, it helps to understand what you are dealing with. PDFs come in two fundamentally different varieties, and a third that combines the worst of both.

Digital PDFs are born digital. Someone typed a document in Word, generated it from an application, or exported it from a database. The text inside is real text — selectable, searchable, stored as character codes. These are the easy case. You can extract text without OCR.

Scanned PDFs are images inside a PDF wrapper. A physical document was placed on a scanner or photographed with a phone. The PDF contains one image per page, and that image contains text, but the PDF file itself has no idea what that text says. These require OCR before anything else can happen.

Hybrid PDFs combine both. A common example: a digitally generated contract where the signature pages were printed, signed by hand, scanned, and appended. Some pages have real text. Others are images. The worst case is a scanned document that was run through a bad OCR layer years ago — it has a text layer, but that layer is full of errors, and the image underneath is the only reliable source.

The Iteration Layer’s Document Extraction API handles all three. For digital PDFs, it reads the text layer directly. For scanned PDFs, it runs OCR automatically. For hybrids, it detects which pages need OCR and which do not. You send the file and a schema. The API figures out the rest.

Schema-Based Extraction: Describe What You Want, Not Where It Is

The key idea is that you define a schema — a list of fields with types and descriptions — and the API extracts values that match. You do not tell the parser where on the page to look. You do not write templates for each document layout. You describe the data you want, and the parser finds it.

Here is a straightforward example: extracting key fields from a scanned invoice.

CurlTypeScriptPythonGo

Request

curl -X POST \
  https://api.iterationlayer.com/document-extraction/v1/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "files": [
      {
        "type": "url",
        "name": "invoice-scan.pdf",
        "url": "https://example.com/scans/invoice-scan.pdf"
      }
    ],
    "schema": {
      "fields": [
        {
          "name": "invoice_number",
          "type": "TEXT",
          "description": "Invoice or document reference number",
          "is_required": true
        },
        {
          "name": "invoice_date",
          "type": "DATE",
          "description": "Date the invoice was issued"
        },
        {
          "name": "vendor_name",
          "type": "TEXT",
          "description": "Name of the company that issued the invoice"
        },
        {
          "name": "vendor_address",
          "type": "ADDRESS",
          "description": "Address of the invoicing company"
        },
        {
          "name": "total_amount",
          "type": "CURRENCY_AMOUNT",
          "description": "Total amount due including tax"
        },
        {
          "name": "currency",
          "type": "CURRENCY_CODE",
          "description": "Currency of the total amount"
        },
        {
          "name": "iban",
          "type": "IBAN",
          "description": "Bank account IBAN for payment"
        }
      ]
    }
  }'

Response

{
  "success": true,
  "data": {
    "invoice_number": {
      "type": "TEXT",
      "value": "INV-2024-03871",
      "confidence": 0.96,
      "citations": ["INV-2024-03871"],
      "source": "invoice-scan.pdf"
    },
    "invoice_date": {
      "type": "DATE",
      "value": "2024-11-08",
      "confidence": 0.93,
      "citations": ["08.11.2024"],
      "source": "invoice-scan.pdf"
    },
    "vendor_name": {
      "type": "TEXT",
      "value": "Schneider Industriebedarf GmbH",
      "confidence": 0.95,
      "citations": ["Schneider Industriebedarf GmbH"],
      "source": "invoice-scan.pdf"
    },
    "vendor_address": {
      "type": "ADDRESS",
      "value": {
        "street": "Industriestraße 12",
        "city": "Stuttgart",
        "region": "Baden-Württemberg",
        "postal_code": "70469",
        "country": "DE"
      },
      "confidence": 0.91,
      "citations": ["Industriestraße 12, 70469 Stuttgart"],
      "source": "invoice-scan.pdf"
    },
    "total_amount": {
      "type": "CURRENCY_AMOUNT",
      "value": 4283.50,
      "confidence": 0.94,
      "citations": ["Gesamtbetrag: EUR 4.283,50"],
      "source": "invoice-scan.pdf"
    },
    "currency": {
      "type": "CURRENCY_CODE",
      "value": "EUR",
      "confidence": 0.97,
      "citations": ["EUR 4.283,50"],
      "source": "invoice-scan.pdf"
    },
    "iban": {
      "type": "IBAN",
      "value": "DE89370400440532013000",
      "confidence": 0.88,
      "citations": ["DE89 3704 0044 0532 0130 00"],
      "source": "invoice-scan.pdf"
    }
  }
}

Request

import { IterationLayer } from "iterationlayer";

const client = new IterationLayer({
  apiKey: "YOUR_API_KEY",
});

const result = await client.extractDocument({
  files: [
    {
      type: "url",
      name: "invoice-scan.pdf",
      url: "https://example.com/scans/invoice-scan.pdf",
    },
  ],
  schema: {
    fields: [
      {
        name: "invoice_number",
        type: "TEXT",
        description: "Invoice or document reference number",
        is_required: true,
      },
      {
        name: "invoice_date",
        type: "DATE",
        description: "Date the invoice was issued",
      },
      {
        name: "vendor_name",
        type: "TEXT",
        description:
          "Name of the company that issued the invoice",
      },
      {
        name: "vendor_address",
        type: "ADDRESS",
        description: "Address of the invoicing company",
      },
      {
        name: "total_amount",
        type: "CURRENCY_AMOUNT",
        description: "Total amount due including tax",
      },
      {
        name: "currency",
        type: "CURRENCY_CODE",
        description: "Currency of the total amount",
      },
      {
        name: "iban",
        type: "IBAN",
        description: "Bank account IBAN for payment",
      },
    ],
  },
});

Response

{
  "success": true,
  "data": {
    "invoice_number": {
      "type": "TEXT",
      "value": "INV-2024-03871",
      "confidence": 0.96,
      "citations": ["INV-2024-03871"],
      "source": "invoice-scan.pdf"
    },
    "invoice_date": {
      "type": "DATE",
      "value": "2024-11-08",
      "confidence": 0.93,
      "citations": ["08.11.2024"],
      "source": "invoice-scan.pdf"
    },
    "vendor_name": {
      "type": "TEXT",
      "value": "Schneider Industriebedarf GmbH",
      "confidence": 0.95,
      "citations": ["Schneider Industriebedarf GmbH"],
      "source": "invoice-scan.pdf"
    },
    "vendor_address": {
      "type": "ADDRESS",
      "value": {
        "street": "Industriestraße 12",
        "city": "Stuttgart",
        "region": "Baden-Württemberg",
        "postal_code": "70469",
        "country": "DE"
      },
      "confidence": 0.91,
      "citations": ["Industriestraße 12, 70469 Stuttgart"],
      "source": "invoice-scan.pdf"
    },
    "total_amount": {
      "type": "CURRENCY_AMOUNT",
      "value": 4283.50,
      "confidence": 0.94,
      "citations": ["Gesamtbetrag: EUR 4.283,50"],
      "source": "invoice-scan.pdf"
    },
    "currency": {
      "type": "CURRENCY_CODE",
      "value": "EUR",
      "confidence": 0.97,
      "citations": ["EUR 4.283,50"],
      "source": "invoice-scan.pdf"
    },
    "iban": {
      "type": "IBAN",
      "value": "DE89370400440532013000",
      "confidence": 0.88,
      "citations": ["DE89 3704 0044 0532 0130 00"],
      "source": "invoice-scan.pdf"
    }
  }
}

Request

from iterationlayer import IterationLayer

client = IterationLayer(api_key="YOUR_API_KEY")

result = client.extract_document(
    files=[
        {
            "type": "url",
            "name": "invoice-scan.pdf",
            "url": "https://example.com/scans/invoice-scan.pdf",
        }
    ],
    schema={
        "fields": [
            {
                "name": "invoice_number",
                "type": "TEXT",
                "description":
                "Invoice or document reference number",
                "is_required": True,
            },
            {
                "name": "invoice_date",
                "type": "DATE",
                "description":
                "Date the invoice was issued",
            },
            {
                "name": "vendor_name",
                "type": "TEXT",
                "description":
                "Name of the company that issued the invoice",
            },
            {
                "name": "vendor_address",
                "type": "ADDRESS",
                "description":
                "Address of the invoicing company",
            },
            {
                "name": "total_amount",
                "type": "CURRENCY_AMOUNT",
                "description":
                "Total amount due including tax",
            },
            {
                "name": "currency",
                "type": "CURRENCY_CODE",
                "description":
                "Currency of the total amount",
            },
            {
                "name": "iban",
                "type": "IBAN",
                "description":
                "Bank account IBAN for payment",
            },
        ]
    },
)

Response

{
  "success": true,
  "data": {
    "invoice_number": {
      "type": "TEXT",
      "value": "INV-2024-03871",
      "confidence": 0.96,
      "citations": ["INV-2024-03871"],
      "source": "invoice-scan.pdf"
    },
    "invoice_date": {
      "type": "DATE",
      "value": "2024-11-08",
      "confidence": 0.93,
      "citations": ["08.11.2024"],
      "source": "invoice-scan.pdf"
    },
    "vendor_name": {
      "type": "TEXT",
      "value": "Schneider Industriebedarf GmbH",
      "confidence": 0.95,
      "citations": ["Schneider Industriebedarf GmbH"],
      "source": "invoice-scan.pdf"
    },
    "vendor_address": {
      "type": "ADDRESS",
      "value": {
        "street": "Industriestraße 12",
        "city": "Stuttgart",
        "region": "Baden-Württemberg",
        "postal_code": "70469",
        "country": "DE"
      },
      "confidence": 0.91,
      "citations": ["Industriestraße 12, 70469 Stuttgart"],
      "source": "invoice-scan.pdf"
    },
    "total_amount": {
      "type": "CURRENCY_AMOUNT",
      "value": 4283.50,
      "confidence": 0.94,
      "citations": ["Gesamtbetrag: EUR 4.283,50"],
      "source": "invoice-scan.pdf"
    },
    "currency": {
      "type": "CURRENCY_CODE",
      "value": "EUR",
      "confidence": 0.97,
      "citations": ["EUR 4.283,50"],
      "source": "invoice-scan.pdf"
    },
    "iban": {
      "type": "IBAN",
      "value": "DE89370400440532013000",
      "confidence": 0.88,
      "citations": ["DE89 3704 0044 0532 0130 00"],
      "source": "invoice-scan.pdf"
    }
  }
}

Request

package main

import il "github.com/iterationlayer/sdk-go"

client := il.NewClient("YOUR_API_KEY")

result, err := client.ExtractDocument(il.ExtractDocumentRequest{
    Files: []il.FileInput{
        il.NewFileFromURL(
            "invoice-scan.pdf",
            "https://example.com/scans/invoice-scan.pdf",
        ),
    },
    Schema: il.ExtractionSchema{
        "invoice_number": il.NewTextFieldConfig(
            "invoice_number",
            "Invoice or document reference number",
        ),
        "invoice_date": il.NewDateFieldConfig(
            "invoice_date",
            "Date the invoice was issued",
        ),
        "vendor_name": il.NewTextFieldConfig(
            "vendor_name",
            "Name of the company that issued the invoice",
        ),
        "vendor_address": il.NewAddressFieldConfig(
            "vendor_address",
            "Address of the invoicing company",
        ),
        "total_amount": il.NewCurrencyAmountFieldConfig(
            "total_amount",
            "Total amount due including tax",
        ),
        "currency": il.NewCurrencyCodeFieldConfig(
            "currency",
            "Currency of the total amount",
        ),
        "iban": il.NewIbanFieldConfig(
            "iban",
            "Bank account IBAN for payment",
        ),
    },
})

Response

{
  "success": true,
  "data": {
    "invoice_number": {
      "type": "TEXT",
      "value": "INV-2024-03871",
      "confidence": 0.96,
      "citations": ["INV-2024-03871"],
      "source": "invoice-scan.pdf"
    },
    "invoice_date": {
      "type": "DATE",
      "value": "2024-11-08",
      "confidence": 0.93,
      "citations": ["08.11.2024"],
      "source": "invoice-scan.pdf"
    },
    "vendor_name": {
      "type": "TEXT",
      "value": "Schneider Industriebedarf GmbH",
      "confidence": 0.95,
      "citations": ["Schneider Industriebedarf GmbH"],
      "source": "invoice-scan.pdf"
    },
    "vendor_address": {
      "type": "ADDRESS",
      "value": {
        "street": "Industriestraße 12",
        "city": "Stuttgart",
        "region": "Baden-Württemberg",
        "postal_code": "70469",
        "country": "DE"
      },
      "confidence": 0.91,
      "citations": ["Industriestraße 12, 70469 Stuttgart"],
      "source": "invoice-scan.pdf"
    },
    "total_amount": {
      "type": "CURRENCY_AMOUNT",
      "value": 4283.50,
      "confidence": 0.94,
      "citations": ["Gesamtbetrag: EUR 4.283,50"],
      "source": "invoice-scan.pdf"
    },
    "currency": {
      "type": "CURRENCY_CODE",
      "value": "EUR",
      "confidence": 0.97,
      "citations": ["EUR 4.283,50"],
      "source": "invoice-scan.pdf"
    },
    "iban": {
      "type": "IBAN",
      "value": "DE89370400440532013000",
      "confidence": 0.88,
      "citations": ["DE89 3704 0044 0532 0130 00"],
      "source": "invoice-scan.pdf"
    }
  }
}

Seven fields, one API call. The ADDRESS field decomposes automatically into street, city, region, postal code, and country. The CURRENCY_CODE returns an ISO 4217 code. The IBAN is validated as a proper IBAN, not just extracted as a string.

Notice the IBAN confidence score: 0.88. Lower than the other fields. That is the parser telling you: “I found something that looks like an IBAN, but I am less certain.” Maybe the scan was slightly blurred in that region. Maybe the digits were partially obscured. The confidence score lets you decide whether to accept or flag it.

Confidence Scores: The Critical Piece for Production Use

Every extracted field includes a confidence score between 0.0 and 1.0. This is not a nice-to-have. It is the difference between a prototype and a production system.

A clean digital PDF with crisp text and clear formatting will score high — 0.90 and above across the board. A scanned document from a 1990s fax machine with a coffee stain across the header will score lower. A hand-written form photographed at an angle under fluorescent lighting will score lower still.

Your code should use these scores to build a routing system:

High confidence (0.90+): Route straight to your database or downstream process. The extraction is reliable enough for automated handling.
Medium confidence (0.70-0.89): Flag for quick human review. Show the extracted value alongside the citation text so the reviewer can confirm or correct with minimal effort.
Low confidence (below 0.70): Route to manual data entry. The scan quality or document layout made extraction unreliable. A human needs to look at the original.

This three-tier approach is how operations teams process thousands of documents without hiring dozens of data entry staff. The API handles the clear cases automatically. Humans handle the ambiguous cases. Nobody wastes time on documents the machine already got right.

Field Types as Built-In Validation

Raw OCR gives you strings. A date extracted by OCR is just text — “08.11.2024” or “November 8, 2024” or “11/08/2024” depending on the document. Your code has to parse all of those formats, handle ambiguity (is “01/02/2024” January 2nd or February 1st?), and validate the result.

Typed field extraction handles this at the extraction layer. When you define a field as DATE, the parser recognizes date formats in context, normalizes to ISO 8601 (2024-11-08), and uses surrounding context to resolve ambiguity. A German invoice with “08.11.2024” returns 2024-11-08, not 2024-08-11.

The same applies to every field type:

CURRENCY_AMOUNT extracts a numeric value from text like “EUR 4.283,50” or “$4,283.50” — handling comma-vs-period decimal separators automatically based on context.
IBAN validates the structure and checksum. A string that looks like an IBAN but has an invalid checksum will still be extracted, but with a lower confidence score.
ADDRESS decomposes into components (street, city, region, postal code, country) rather than returning a single string. An address from a German document returns "country": "DE", not "country": "Germany" or "country": "Deutschland".
CURRENCY_CODE returns an ISO 4217 code. The parser maps “Euro”, “EUR”, and the euro symbol to "EUR".
COUNTRY returns an ISO 3166-1 alpha-2 code. “Germany”, “Deutschland”, “DE”, “DEU” all normalize to "DE".
BOOLEAN interprets checkboxes, yes/no fields, and similar binary indicators.
EMAIL validates the extracted value against email format rules.

This means the API does double duty: extraction and validation in one step. You do not need a separate validation layer to check that the IBAN is structurally valid, the date is plausible, or the currency code is a real ISO code.

Extracting Tables and Repeated Data with ARRAY Fields

Invoices, purchase orders, and shipping manifests all contain line items — tables with repeated rows of the same structure. The ARRAY field type handles these without any changes to the extraction approach.

CurlTypeScriptPythonGo

Request

curl -X POST \
  https://api.iterationlayer.com/document-extraction/v1/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "files": [
      {
        "type": "base64",
        "name": "purchase-order.pdf",
        "base64": "<PDF_BASE64>"
      }
    ],
    "schema": {
      "fields": [
        {
          "name": "po_number",
          "type": "TEXT",
          "description": "Purchase order number",
          "is_required": true
        },
        {
          "name": "order_date",
          "type": "DATE",
          "description": "Date the purchase order was issued"
        },
        {
          "name": "line_items",
          "type": "ARRAY",
          "description": "List of ordered items",
          "fields": [
              {
                "name": "description",
                "type": "TEXT",
                "description": "Item description"
              },
              {
                "name": "quantity",
                "type": "INTEGER",
                "description": "Quantity ordered"
              },
              {
                "name": "unit_price",
                "type": "CURRENCY_AMOUNT",
                "description": "Price per unit"
              }
            ]
        },
        {
          "name": "subtotal",
          "type": "CURRENCY_AMOUNT",
          "description": "Subtotal before tax"
        },
        {
          "name": "tax_amount",
          "type": "CURRENCY_AMOUNT",
          "description": "Tax amount"
        },
        {
          "name": "total",
          "type": "CURRENCY_AMOUNT",
          "description": "Total amount including tax"
        },
        {
          "name": "computed_total",
          "type": "CALCULATED",
          "description": "Sum of subtotal and tax for cross-check",
          "operation": "sum",
          "source_field_names": ["subtotal", "tax_amount"]
        }
      ]
    }
  }'

Response

{
  "success": true,
  "data": {
    "po_number": {
      "type": "TEXT",
      "value": "PO-2024-00412",
      "confidence": 0.97,
      "citations": ["PO-2024-00412"],
      "source": "purchase-order.pdf"
    },
    "order_date": {
      "type": "DATE",
      "value": "2024-10-22",
      "confidence": 0.95,
      "citations": ["22.10.2024"],
      "source": "purchase-order.pdf"
    },
    "line_items": {
      "type": "ARRAY",
      "value": [
        {
          "description": {
            "value": "Hydraulic Cylinder Model HC-200",
            "confidence": 0.94,
            "citations": ["Hydraulic Cylinder Model HC-200"]
          },
          "quantity": {
            "value": 12,
            "confidence": 0.96,
            "citations": ["12"]
          },
          "unit_price": {
            "value": 245.00,
            "confidence": 0.93,
            "citations": ["245,00"]
          }
        },
        {
          "description": {
            "value": "Pressure Gauge PG-50",
            "confidence": 0.95,
            "citations": ["Pressure Gauge PG-50"]
          },
          "quantity": {
            "value": 24,
            "confidence": 0.97,
            "citations": ["24"]
          },
          "unit_price": {
            "value": 38.50,
            "confidence": 0.94,
            "citations": ["38,50"]
          }
        }
      ],
      "confidence": 0.95,
      "citations": [],
      "source": "purchase-order.pdf"
    },
    "subtotal": {
      "type": "CURRENCY_AMOUNT",
      "value": 3864.00,
      "confidence": 0.94,
      "citations": ["Nettobetrag: EUR 3.864,00"],
      "source": "purchase-order.pdf"
    },
    "tax_amount": {
      "type": "CURRENCY_AMOUNT",
      "value": 734.16,
      "confidence": 0.93,
      "citations": ["MwSt. 19%: EUR 734,16"],
      "source": "purchase-order.pdf"
    },
    "total": {
      "type": "CURRENCY_AMOUNT",
      "value": 4598.16,
      "confidence": 0.95,
      "citations": ["Gesamtbetrag: EUR 4.598,16"],
      "source": "purchase-order.pdf"
    },
    "computed_total": {
      "type": "CALCULATED",
      "value": 4598.16,
      "confidence": 0.94,
      "citations": [],
      "source": "purchase-order.pdf"
    }
  }
}

Request

const result = await client.extractDocument({
  files: [
    {
      type: "base64",
      name: "purchase-order.pdf",
      base64: pdfBase64,
    },
  ],
  schema: {
    fields: [
      {
        name: "po_number",
        type: "TEXT",
        description: "Purchase order number",
        is_required: true,
      },
      {
        name: "order_date",
        type: "DATE",
        description: "Date the purchase order was issued",
      },
      {
        name: "line_items",
        type: "ARRAY",
        description: "List of ordered items",
        fields: [
            {
              name: "description",
              type: "TEXT",
              description: "Item description",
            },
            {
              name: "quantity",
              type: "INTEGER",
              description: "Quantity ordered",
            },
            {
              name: "unit_price",
              type: "CURRENCY_AMOUNT",
              description: "Price per unit",
            },
          ],
      },
      {
        name: "subtotal",
        type: "CURRENCY_AMOUNT",
        description: "Subtotal before tax",
      },
      {
        name: "tax_amount",
        type: "CURRENCY_AMOUNT",
        description: "Tax amount",
      },
      {
        name: "total",
        type: "CURRENCY_AMOUNT",
        description: "Total amount including tax",
      },
      {
        name: "computed_total",
        type: "CALCULATED",
        description:
          "Sum of subtotal and tax for cross-check",
        operation: "sum",
        source_field_names: ["subtotal", "tax_amount"],
      },
    ],
  },
});

Response

{
  "success": true,
  "data": {
    "po_number": {
      "type": "TEXT",
      "value": "PO-2024-00412",
      "confidence": 0.97,
      "citations": ["PO-2024-00412"],
      "source": "purchase-order.pdf"
    },
    "order_date": {
      "type": "DATE",
      "value": "2024-10-22",
      "confidence": 0.95,
      "citations": ["22.10.2024"],
      "source": "purchase-order.pdf"
    },
    "line_items": {
      "type": "ARRAY",
      "value": [
        {
          "description": {
            "value": "Hydraulic Cylinder Model HC-200",
            "confidence": 0.94,
            "citations": ["Hydraulic Cylinder Model HC-200"]
          },
          "quantity": {
            "value": 12,
            "confidence": 0.96,
            "citations": ["12"]
          },
          "unit_price": {
            "value": 245.00,
            "confidence": 0.93,
            "citations": ["245,00"]
          }
        },
        {
          "description": {
            "value": "Pressure Gauge PG-50",
            "confidence": 0.95,
            "citations": ["Pressure Gauge PG-50"]
          },
          "quantity": {
            "value": 24,
            "confidence": 0.97,
            "citations": ["24"]
          },
          "unit_price": {
            "value": 38.50,
            "confidence": 0.94,
            "citations": ["38,50"]
          }
        }
      ],
      "confidence": 0.95,
      "citations": [],
      "source": "purchase-order.pdf"
    },
    "subtotal": {
      "type": "CURRENCY_AMOUNT",
      "value": 3864.00,
      "confidence": 0.94,
      "citations": ["Nettobetrag: EUR 3.864,00"],
      "source": "purchase-order.pdf"
    },
    "tax_amount": {
      "type": "CURRENCY_AMOUNT",
      "value": 734.16,
      "confidence": 0.93,
      "citations": ["MwSt. 19%: EUR 734,16"],
      "source": "purchase-order.pdf"
    },
    "total": {
      "type": "CURRENCY_AMOUNT",
      "value": 4598.16,
      "confidence": 0.95,
      "citations": ["Gesamtbetrag: EUR 4.598,16"],
      "source": "purchase-order.pdf"
    },
    "computed_total": {
      "type": "CALCULATED",
      "value": 4598.16,
      "confidence": 0.94,
      "citations": [],
      "source": "purchase-order.pdf"
    }
  }
}

Request

result = client.extract_document(
    files=[
        {
            "type": "base64",
            "name": "purchase-order.pdf",
            "base64": pdf_base64,
        }
    ],
    schema={
        "fields": [
            {
                "name": "po_number",
                "type": "TEXT",
                "description": "Purchase order number",
                "is_required": True,
            },
            {
                "name": "order_date",
                "type": "DATE",
                "description":
                "Date the purchase order was issued",
            },
            {
                "name": "line_items",
                "type": "ARRAY",
                "description": "List of ordered items",
                "fields": [
                        {
                            "name": "description",
                            "type": "TEXT",
                            "description":
                            "Item description",
                        },
                        {
                            "name": "quantity",
                            "type": "INTEGER",
                            "description":
                            "Quantity ordered",
                        },
                        {
                            "name": "unit_price",
                            "type": "CURRENCY_AMOUNT",
                            "description":
                            "Price per unit",
                        },
                    ],
            },
            {
                "name": "subtotal",
                "type": "CURRENCY_AMOUNT",
                "description": "Subtotal before tax",
            },
            {
                "name": "tax_amount",
                "type": "CURRENCY_AMOUNT",
                "description": "Tax amount",
            },
            {
                "name": "total",
                "type": "CURRENCY_AMOUNT",
                "description":
                "Total amount including tax",
            },
            {
                "name": "computed_total",
                "type": "CALCULATED",
                "description":
                "Sum of subtotal and tax for cross-check",
                "operation": "sum",
                "source_field_names": [
                    "subtotal",
                    "tax_amount",
                ],
            },
        ]
    },
)

Response

{
  "success": true,
  "data": {
    "po_number": {
      "type": "TEXT",
      "value": "PO-2024-00412",
      "confidence": 0.97,
      "citations": ["PO-2024-00412"],
      "source": "purchase-order.pdf"
    },
    "order_date": {
      "type": "DATE",
      "value": "2024-10-22",
      "confidence": 0.95,
      "citations": ["22.10.2024"],
      "source": "purchase-order.pdf"
    },
    "line_items": {
      "type": "ARRAY",
      "value": [
        {
          "description": {
            "value": "Hydraulic Cylinder Model HC-200",
            "confidence": 0.94,
            "citations": ["Hydraulic Cylinder Model HC-200"]
          },
          "quantity": {
            "value": 12,
            "confidence": 0.96,
            "citations": ["12"]
          },
          "unit_price": {
            "value": 245.00,
            "confidence": 0.93,
            "citations": ["245,00"]
          }
        },
        {
          "description": {
            "value": "Pressure Gauge PG-50",
            "confidence": 0.95,
            "citations": ["Pressure Gauge PG-50"]
          },
          "quantity": {
            "value": 24,
            "confidence": 0.97,
            "citations": ["24"]
          },
          "unit_price": {
            "value": 38.50,
            "confidence": 0.94,
            "citations": ["38,50"]
          }
        }
      ],
      "confidence": 0.95,
      "citations": [],
      "source": "purchase-order.pdf"
    },
    "subtotal": {
      "type": "CURRENCY_AMOUNT",
      "value": 3864.00,
      "confidence": 0.94,
      "citations": ["Nettobetrag: EUR 3.864,00"],
      "source": "purchase-order.pdf"
    },
    "tax_amount": {
      "type": "CURRENCY_AMOUNT",
      "value": 734.16,
      "confidence": 0.93,
      "citations": ["MwSt. 19%: EUR 734,16"],
      "source": "purchase-order.pdf"
    },
    "total": {
      "type": "CURRENCY_AMOUNT",
      "value": 4598.16,
      "confidence": 0.95,
      "citations": ["Gesamtbetrag: EUR 4.598,16"],
      "source": "purchase-order.pdf"
    },
    "computed_total": {
      "type": "CALCULATED",
      "value": 4598.16,
      "confidence": 0.94,
      "citations": [],
      "source": "purchase-order.pdf"
    }
  }
}

Request

result, err := client.ExtractDocument(il.ExtractDocumentRequest{
    Files: []il.FileInput{
        il.NewFileFromBase64(
            "purchase-order.pdf", pdfBase64,
        ),
    },
    Schema: il.ExtractionSchema{
        "po_number": il.NewTextFieldConfig(
            "po_number", "Purchase order number",
        ),
        "order_date": il.NewDateFieldConfig(
            "order_date",
            "Date the purchase order was issued",
        ),
        "line_items": il.NewArrayFieldConfig(
            "line_items",
            "List of ordered items",
            []il.FieldConfig{
                il.NewTextFieldConfig(
                    "description", "Item description",
                ),
                il.NewIntegerFieldConfig(
                    "quantity", "Quantity ordered",
                ),
                il.NewCurrencyAmountFieldConfig(
                    "unit_price", "Price per unit",
                ),
            },
        ),
        "subtotal": il.NewCurrencyAmountFieldConfig(
            "subtotal", "Subtotal before tax",
        ),
        "tax_amount": il.NewCurrencyAmountFieldConfig(
            "tax_amount", "Tax amount",
        ),
        "total": il.NewCurrencyAmountFieldConfig(
            "total", "Total amount including tax",
        ),
        "computed_total": il.NewCalculatedFieldConfig(
            "computed_total",
            "Sum of subtotal and tax for cross-check",
            "sum",
            []string{"subtotal", "tax_amount"},
        ),
    },
})

Response

{
  "success": true,
  "data": {
    "po_number": {
      "type": "TEXT",
      "value": "PO-2024-00412",
      "confidence": 0.97,
      "citations": ["PO-2024-00412"],
      "source": "purchase-order.pdf"
    },
    "order_date": {
      "type": "DATE",
      "value": "2024-10-22",
      "confidence": 0.95,
      "citations": ["22.10.2024"],
      "source": "purchase-order.pdf"
    },
    "line_items": {
      "type": "ARRAY",
      "value": [
        {
          "description": {
            "value": "Hydraulic Cylinder Model HC-200",
            "confidence": 0.94,
            "citations": ["Hydraulic Cylinder Model HC-200"]
          },
          "quantity": {
            "value": 12,
            "confidence": 0.96,
            "citations": ["12"]
          },
          "unit_price": {
            "value": 245.00,
            "confidence": 0.93,
            "citations": ["245,00"]
          }
        },
        {
          "description": {
            "value": "Pressure Gauge PG-50",
            "confidence": 0.95,
            "citations": ["Pressure Gauge PG-50"]
          },
          "quantity": {
            "value": 24,
            "confidence": 0.97,
            "citations": ["24"]
          },
          "unit_price": {
            "value": 38.50,
            "confidence": 0.94,
            "citations": ["38,50"]
          }
        }
      ],
      "confidence": 0.95,
      "citations": [],
      "source": "purchase-order.pdf"
    },
    "subtotal": {
      "type": "CURRENCY_AMOUNT",
      "value": 3864.00,
      "confidence": 0.94,
      "citations": ["Nettobetrag: EUR 3.864,00"],
      "source": "purchase-order.pdf"
    },
    "tax_amount": {
      "type": "CURRENCY_AMOUNT",
      "value": 734.16,
      "confidence": 0.93,
      "citations": ["MwSt. 19%: EUR 734,16"],
      "source": "purchase-order.pdf"
    },
    "total": {
      "type": "CURRENCY_AMOUNT",
      "value": 4598.16,
      "confidence": 0.95,
      "citations": ["Gesamtbetrag: EUR 4.598,16"],
      "source": "purchase-order.pdf"
    },
    "computed_total": {
      "type": "CALCULATED",
      "value": 4598.16,
      "confidence": 0.94,
      "citations": [],
      "source": "purchase-order.pdf"
    }
  }
}

The ARRAY field extracts variable-length tables without knowing the number of rows in advance. Each row gets its own set of confidence scores. The CALCULATED field computes subtotal + tax_amount and returns the result — you can compare it against the extracted total to catch discrepancies.

Cross-Checking with CALCULATED Fields

The computed_total in the example above is not just a convenience. It is a validation mechanism.

If the extracted total is 4,598.16 and the computed subtotal + tax_amount is also 4,598.16, the numbers are internally consistent. If they do not match, something went wrong — either the OCR misread a digit, or the document itself has an error.

Four operations are available: sum, subtract, multiply, and divide. The source fields must be numeric types (INTEGER, DECIMAL, or CURRENCY_AMOUNT). This is particularly valuable for financial documents where amounts should add up, quantities times unit prices should equal line totals, and discounts should subtract correctly.

Batch Processing: Digitizing an Archive

The real value of schema-based extraction shows up at scale. You have 2,000 scanned invoices in a folder. You need every one of them in your accounting system by the end of the quarter.

The API accepts up to 20 files per request, with a combined size up to 200 MB (50 MB per file). The parser extracts the same schema from each file and returns results individually, each with its own confidence scores.

CurlTypeScriptPythonGo

curl -X POST \
  https://api.iterationlayer.com/document-extraction/v1/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "files": [
      {
        "type": "url",
        "name": "invoice-001.pdf",
        "url": "https://example.com/scans/invoice-001.pdf"
      },
      {
        "type": "url",
        "name": "invoice-002.pdf",
        "url": "https://example.com/scans/invoice-002.pdf"
      },
      {
        "type": "url",
        "name": "invoice-003.pdf",
        "url": "https://example.com/scans/invoice-003.pdf"
      }
    ],
    "schema": {
      "fields": [
        {
          "name": "invoice_number",
          "type": "TEXT",
          "description": "Invoice reference number",
          "is_required": true
        },
        {
          "name": "invoice_date",
          "type": "DATE",
          "description": "Date the invoice was issued"
        },
        {
          "name": "total_amount",
          "type": "CURRENCY_AMOUNT",
          "description": "Total amount due"
        },
        {
          "name": "currency",
          "type": "CURRENCY_CODE",
          "description": "Currency of the total amount"
        }
      ]
    }
  }'

const result = await client.extractDocument({
  files: [
    {
      type: "url",
      name: "invoice-001.pdf",
      url: "https://example.com/scans/invoice-001.pdf",
    },
    {
      type: "url",
      name: "invoice-002.pdf",
      url: "https://example.com/scans/invoice-002.pdf",
    },
    {
      type: "url",
      name: "invoice-003.pdf",
      url: "https://example.com/scans/invoice-003.pdf",
    },
  ],
  schema: {
    fields: [
      {
        name: "invoice_number",
        type: "TEXT",
        description: "Invoice reference number",
        is_required: true,
      },
      {
        name: "invoice_date",
        type: "DATE",
        description: "Date the invoice was issued",
      },
      {
        name: "total_amount",
        type: "CURRENCY_AMOUNT",
        description: "Total amount due",
      },
      {
        name: "currency",
        type: "CURRENCY_CODE",
        description: "Currency of the total amount",
      },
    ],
  },
});

result = client.extract_document(
    files=[
        {
            "type": "url",
            "name": "invoice-001.pdf",
            "url": "https://example.com/scans/invoice-001.pdf",
        },
        {
            "type": "url",
            "name": "invoice-002.pdf",
            "url": "https://example.com/scans/invoice-002.pdf",
        },
        {
            "type": "url",
            "name": "invoice-003.pdf",
            "url": "https://example.com/scans/invoice-003.pdf",
        },
    ],
    schema={
        "fields": [
            {
                "name": "invoice_number",
                "type": "TEXT",
                "description":
                "Invoice reference number",
                "is_required": True,
            },
            {
                "name": "invoice_date",
                "type": "DATE",
                "description":
                "Date the invoice was issued",
            },
            {
                "name": "total_amount",
                "type": "CURRENCY_AMOUNT",
                "description": "Total amount due",
            },
            {
                "name": "currency",
                "type": "CURRENCY_CODE",
                "description":
                "Currency of the total amount",
            },
        ]
    },
)

result, err := client.ExtractDocument(il.ExtractDocumentRequest{
    Files: []il.FileInput{
        il.NewFileFromURL(
            "invoice-001.pdf",
            "https://example.com/scans/invoice-001.pdf",
        ),
        il.NewFileFromURL(
            "invoice-002.pdf",
            "https://example.com/scans/invoice-002.pdf",
        ),
        il.NewFileFromURL(
            "invoice-003.pdf",
            "https://example.com/scans/invoice-003.pdf",
        ),
    },
    Schema: il.ExtractionSchema{
        "invoice_number": il.NewTextFieldConfig(
            "invoice_number",
            "Invoice reference number",
        ),
        "invoice_date": il.NewDateFieldConfig(
            "invoice_date",
            "Date the invoice was issued",
        ),
        "total_amount": il.NewCurrencyAmountFieldConfig(
            "total_amount", "Total amount due",
        ),
        "currency": il.NewCurrencyCodeFieldConfig(
            "currency",
            "Currency of the total amount",
        ),
    },
})

For larger archives, chunk the files into batches of 20 and process them in parallel. A 2,000-document archive becomes 100 requests. With confidence-based routing, the high-confidence extractions go straight to your database, and only the ambiguous ones need human attention.

Supported File Types

The API handles more than PDFs. You can send:

PDFs — digital, scanned, or hybrid
Word documents — DOCX files with embedded text and tables
Images — PNG, JPG, GIF, WEBP (these always get OCR)
Text files — MD, TXT, CSV, JSON

Images are the most common input for legacy document digitization. Someone photographs a paper form with their phone, uploads the JPG, and the API runs OCR and extraction in one step. No need to convert to PDF first.

File Inputs: URLs or Base64

Two ways to send files:

URL — point to a file hosted somewhere: { "type": "url", "name": "doc.pdf", "url": "https://..." }
Base64 — embed the file contents: { "type": "base64", "name": "doc.pdf", "base64": "..." }

The parser handles 40+ formats — PDFs, Office documents (DOCX, PPTX, ODT, ODS, XLSX), EPUB, RTF, LaTeX, email (EML, MSG), Jupyter notebooks, images, and text/markup formats. Images get OCR automatically — no separate step.

Chaining Extraction with Document Generation

Extraction is the first step. What happens next depends on your workflow.

A common pattern in operations teams: extract data from incoming documents, validate it, then generate a standardized output document. A logistics company receives shipping manifests in different formats from different carriers. They extract the shipment details, normalize the data, and generate a unified report in their own format.

With composable APIs, this becomes two chained calls:

Extract shipment data from the carrier’s document — tracking numbers, weights, dimensions, delivery addresses — using the Document Extraction API.
Generate a standardized shipping report using the Document Generation API — same data, consistent format, ready for the warehouse team.

Same API key. Same credit pool. No glue code between the extraction step and the generation step. The structured JSON from the extraction response is the input for the document template.

Where Specialized OCR Tools Still Win

If you need to extract text from a single document format that never changes — the same form, the same layout, every time — a template-based parser with fixed coordinate extraction will be faster and possibly more accurate. Tools like AWS Textract with custom adapters or dedicated form-recognition services are optimized for this.

The schema-based approach wins when your documents vary. Different invoice layouts from different vendors. Different form designs across years of process changes. Different scan qualities from different offices. You define what data you want, and the parser adapts to wherever that data appears on the page.

The tradeoff is explicit: template-based tools are faster on uniform documents. Schema-based extraction is more flexible across diverse document types. If your archive contains documents from dozens of sources in various formats, the flexibility saves more time than the template approach’s speed advantage.

Handling Errors

Common error scenarios to handle in production:

401 Unauthorized — invalid or missing API key
400 Bad Request — malformed schema (e.g., ARRAY field missing fields, unknown field type, more than 100 schema fields)
413 Payload Too Large — file exceeds 50 MB, or total payload exceeds 200 MB
422 Unprocessable Entity — the file could not be read (corrupted PDF, unsupported format)

For production code, check both the HTTP status and the success field. A 200 response with success: true means the extraction completed. Each field in the response has a value and a confidence score.

Get Started

The full API reference, field type documentation, and SDK guides are in the Document Extraction docs. Install the TypeScript SDK (iterationlayer on npm), the Python SDK (iterationlayer on PyPI), or the Go SDK and start extracting.

Sign up for a free account — no credit card required. Define a schema, send a document, and check the confidence scores. The same schema you test with one invoice works on every invoice in your archive — no per-layout configuration needed.

If the extracted data feeds into reports, contracts, or other generated documents, the Document Generation API takes structured JSON and produces polished PDFs, DOCX, EPUB, or PPTX. Same auth, same credits, one pipeline from scanned paper to finished output.

Ingest

Generate

Integrations

Built for

By industry

Overview

APIs

Integrations

Billing

Benchmarks

Blog

More

The Filing Cabinet Problem

Scanned vs. Digital: Two Kinds of PDF, Same Extraction Problem

Schema-Based Extraction: Describe What You Want, Not Where It Is

Confidence Scores: The Critical Piece for Production Use

Field Types as Built-In Validation

Extracting Tables and Repeated Data with ARRAY Fields

Cross-Checking with CALCULATED Fields

Batch Processing: Digitizing an Archive

Supported File Types

File Inputs: URLs or Base64

Chaining Extraction with Document Generation

Where Specialized OCR Tools Still Win

Handling Errors

Get Started

Try with your own data

Document Extraction