Document Extraction

Extract structured data from any document with a single API call. Send one or more files and a schema defining the fields you need, and receive typed, validated results with confidence scores and source citations.

Key Features

  • Multi-Format Support — Extract from PDF, DOCX, XLSX, images, HTML, Markdown, CSV, JSON, and plain text.
  • 17 Field Types — Text, numbers, dates, booleans, enums, emails, IBANs, countries, currencies, addresses, arrays, and calculated fields.
  • Structured Arrays — Extract repeating data like invoice line items with nested schemas.
  • Calculated Fields — Define arithmetic operations (sum, subtract, multiply, divide) computed from other extracted fields.
  • Confidence Scores — Every extracted value includes a confidence score between 0 and 1.
  • Source Citations — Verbatim quotes from the document that support each extracted value.
  • Schema Validation — Field schemas are validated before extraction, catching errors like circular dependencies or type mismatches early.

Overview

The Document Extraction API analyzes documents and extracts structured data based on a schema you define. You send one or more files (base64 or URL) and a schema with field definitions, and receive a JSON response with typed values, confidence scores, and citations.

Endpoint: POST /document-extraction/v1/extract

Supported formats: PDF, DOCX, XLSX/XLS, CSV, TXT, Markdown, JSON, HTML, PNG, JPEG, GIF, WebP

Limits:

  • Max file size: 50 MB per file

Request Format

curl -X POST https://api.iterationlayer.com/document-extraction/v1/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "files": [
      {
        "type": "base64",
        "name": "invoice.pdf",
        "base64": "<base64-encoded-file>"
      }
    ],
    "schema": {
      "fields": [
        { "name": "invoice_number", "type": "TEXT", "description": "The invoice number" },
        { "name": "total_amount", "type": "CURRENCY_AMOUNT", "description": "Total invoice amount" }
      ]
    }
  }'
import { IterationLayer } from "iterationlayer";
const client = new IterationLayer({ apiKey: "YOUR_API_KEY" });

const result = await client.extract({
  files: [
    {
      type: "base64",
      name: "invoice.pdf",
      base64: "<base64-encoded-file>",
    },
  ],
  schema: {
    fields: [
      { name: "invoice_number", type: "TEXT", description: "The invoice number" },
      { name: "total_amount", type: "CURRENCY_AMOUNT", description: "Total invoice amount" },
    ],
  },
});
from iterationlayer import IterationLayer
client = IterationLayer(api_key="YOUR_API_KEY")

result = client.extract(
    files=[
        {
            "type": "base64",
            "name": "invoice.pdf",
            "base64": "<base64-encoded-file>",
        }
    ],
    schema={
        "fields": [
            {"name": "invoice_number", "type": "TEXT", "description": "The invoice number"},
            {"name": "total_amount", "type": "CURRENCY_AMOUNT", "description": "Total invoice amount"},
        ]
    },
)
import il "github.com/iterationlayer/sdk-go"
client := il.NewClient("YOUR_API_KEY")

result, err := client.Extract(il.ExtractRequest{
    Files: []il.FileInput{il.NewFileFromBase64("invoice.pdf", "<base64-encoded-file>")},
    Schema: il.ExtractionSchema{
        "invoice_number": il.NewTextFieldConfig("invoice_number", "The invoice number"),
        "total_amount":   il.NewCurrencyAmountFieldConfig("total_amount", "Total invoice amount"),
    },
})

Top-Level Fields

Field Type Required Description
files array Yes List of file inputs to extract from (see File Input below)
schema object Yes Extraction schema defining the fields to extract
webhook_url string No HTTPS URL to receive results asynchronously. If provided, returns 201 immediately. See Webhooks.

Async Mode

Add a webhook_url parameter to process the request in the background. The API returns 201 Accepted immediately and delivers the result to your webhook URL when processing completes. See Webhooks for payload format and retry behavior.

File Input

Provide each file as either base64 or a URL:

Field Type Required Description
type string Yes "base64" or "url"
name string Yes Filename with extension (e.g., invoice.pdf)
base64 string If type is base64 Base64-encoded file content
url string If type is url Publicly accessible URL to fetch the file from

URL Example

{
  "files": [
    {
      "type": "url",
      "name": "invoice.pdf",
      "url": "https://example.com/invoice.pdf"
    }
  ],
  "schema": {
    "fields": [
      { "name": "invoice_number", "type": "TEXT", "description": "The invoice number" }
    ]
  }
}

Schema Definition

The schema object contains a fields array. Each field has these base properties:

Field Type Required Description
name string Yes Unique field identifier (used as key in the response)
type string Yes One of the 17 supported field types
description string Yes Natural language description of what to extract
is_required boolean No If true, returns an error when the field cannot be extracted and has no default. Default: false

Each field type supports additional type-specific properties described below.

Field Types

TEXT

Short single-line text value.

Property Type Required Description
max_length integer No Maximum character length (> 0)
default_value string No Default value if not found
{ "name": "company_name", "type": "TEXT", "description": "Name of the company" }

TEXTAREA

Multi-line text value.

Property Type Required Description
max_length integer No Maximum character length (> 0)
default_value string No Default value if not found
{ "name": "notes", "type": "TEXTAREA", "description": "Additional notes or comments" }

INTEGER

Whole number value.

Property Type Required Description
min integer No Minimum value (inclusive)
max integer No Maximum value (inclusive)
unit string No Unit label (e.g., "kg", "items")
default_value integer No Default value if not found
{ "name": "quantity", "type": "INTEGER", "description": "Number of items ordered", "min": 1 }

DECIMAL

Floating-point number value.

Property Type Required Description
min float No Minimum value (inclusive)
max float No Maximum value (inclusive)
decimal_points integer No Number of decimal places to round to (>= 0)
unit string No Unit label
default_value float No Default value if not found
{ "name": "weight", "type": "DECIMAL", "description": "Package weight", "unit": "kg", "decimal_points": 2 }

DATE

Calendar date, extracted as an ISO 8601 string (YYYY-MM-DD).

Property Type Required Description
allow_future_dates boolean No Whether to allow dates in the future
allow_past_dates boolean No Whether to allow dates in the past
{ "name": "invoice_date", "type": "DATE", "description": "Date the invoice was issued" }

DATETIME

Date and time, extracted as an ISO 8601 datetime string.

Property Type Required Description
allow_future_dates boolean No Whether to allow dates in the future
allow_past_dates boolean No Whether to allow dates in the past
{ "name": "timestamp", "type": "DATETIME", "description": "Transaction timestamp" }

TIME

Time value (e.g., "14:30:00"). No additional parameters.

{ "name": "delivery_time", "type": "TIME", "description": "Scheduled delivery time" }

ENUM

One or more values from a predefined list. Extracted as a string array.

Property Type Required Description
values string[] Yes Allowed options
min_selected integer No Minimum number of selected values (>= 0)
max_selected integer No Maximum number of selected values (> 0)
default_value string[] No Default selected values
{
  "name": "payment_method",
  "type": "ENUM",
  "description": "How the invoice was paid",
  "values": ["bank_transfer", "credit_card", "cash", "paypal"],
  "max_selected": 1
}

BOOLEAN

True or false value.

Property Type Required Description
default_value boolean No Default value if not found
{ "name": "is_paid", "type": "BOOLEAN", "description": "Whether the invoice has been paid" }

EMAIL

Email address string.

Property Type Required Description
default_value string No Default value if not found
{ "name": "contact_email", "type": "EMAIL", "description": "Contact email address" }

IBAN

International Bank Account Number. Validated against the pattern ^[A-Z]{2}\d{2}[A-Z0-9]{11,30}$.

Property Type Required Description
default_value string No Default value if not found
{ "name": "bank_account", "type": "IBAN", "description": "Recipient IBAN" }

COUNTRY

ISO 3166-1 alpha-2 country code (e.g., "DE", "US").

Property Type Required Description
default_value string No Must be a valid ISO 3166-1 alpha-2 code
{ "name": "origin_country", "type": "COUNTRY", "description": "Country of origin" }

CURRENCY_CODE

ISO 4217 currency code (e.g., "EUR", "USD").

Property Type Required Description
default_value string No Must be a valid ISO 4217 code
{ "name": "currency", "type": "CURRENCY_CODE", "description": "Invoice currency" }

CURRENCY_AMOUNT

Numeric monetary amount.

Property Type Required Description
min float No Minimum value (inclusive)
max float No Maximum value (inclusive)
decimal_points integer No Number of decimal places to round to (>= 0)
default_value float No Default value if not found
{ "name": "total_amount", "type": "CURRENCY_AMOUNT", "description": "Total invoice amount", "decimal_points": 2 }

ADDRESS

Structured address object. Extracted as an object with street, city, region, postal_code, and country fields.

Property Type Required Description
allowed_country_codes string[] No Restrict to specific ISO 3166-1 alpha-2 country codes

Response value shape:

{
  "street": "123 Main St",
  "city": "Berlin",
  "region": "Berlin",
  "postal_code": "10115",
  "country": "DE"
}
{ "name": "billing_address", "type": "ADDRESS", "description": "Billing address", "allowed_country_codes": ["DE", "AT", "CH"] }

ARRAY

A list of structured objects, each conforming to a nested schema. Use this for repeating data like line items.

Property Type Required Description
item_schema object Yes Nested schema with a fields array

The item_schema.fields array uses the same field configuration format as top-level fields.

{
  "name": "line_items",
  "type": "ARRAY",
  "description": "Invoice line items",
  "item_schema": {
    "fields": [
      { "name": "description", "type": "TEXT", "description": "Item description" },
      { "name": "quantity", "type": "INTEGER", "description": "Quantity ordered", "min": 1 },
      { "name": "unit_price", "type": "CURRENCY_AMOUNT", "description": "Price per unit", "decimal_points": 2 },
      { "name": "total", "type": "CURRENCY_AMOUNT", "description": "Line item total", "decimal_points": 2 }
    ]
  }
}

CALCULATED

A derived numeric value computed from other extracted fields. Not extracted from the document — calculated after all source fields are resolved.

Property Type Required Description
operation string Yes One of: "sum", "subtract", "multiply", "divide"
source_field_names string[] Yes Names of fields to apply the operation to, in order
unit string No Unit label

Source fields must be numeric types: INTEGER, DECIMAL, CURRENCY_AMOUNT, or another CALCULATED. Circular dependencies are detected and rejected at validation time.

{
  "name": "tax_amount",
  "type": "CALCULATED",
  "description": "Tax amount (total minus net)",
  "operation": "subtract",
  "source_field_names": ["total_amount", "net_amount"]
}

Response Format

Success Response

{
  "success": true,
  "data": {
    "invoice_number": {
      "type": "TEXT",
      "value": "INV-2024-001",
      "confidence": 0.98,
      "citations": ["Invoice No: INV-2024-001"],
      "source": "invoice.pdf"
    },
    "total_amount": {
      "type": "CURRENCY_AMOUNT",
      "value": 1250.00,
      "confidence": 0.95,
      "citations": ["Total: €1,250.00"],
      "source": "invoice.pdf"
    }
  }
}

Each field in data contains:

Field Type Description
type string The field type from the schema
value varies Extracted value (type depends on field type — see below)
confidence float Confidence score between 0.0 and 1.0
citations string[] Verbatim quotes from the source document
source string Filename the value was extracted from

Value types by field type:

Field Type Value Type
TEXT, TEXTAREA, EMAIL, IBAN, COUNTRY, CURRENCY_CODE, DATE, DATETIME, TIME string
INTEGER, DECIMAL, CURRENCY_AMOUNT, CALCULATED number
BOOLEAN boolean
ENUM string[]
ADDRESS object
ARRAY object[]

Fields that could not be extracted and have no default_value are omitted from data (unless is_required is true, which causes an error). Fields resolved via default_value have a confidence of 1.0.

Recipes

For complete, runnable examples see the Recipes page.

Error Responses

All errors return a JSON body with { "success": false, "error": "<message>" }.

Status Description
400 Invalid request (missing files/schema, invalid base64, URL fetch failure, file size exceeded, invalid field config)
401 Missing or invalid API key
402 Insufficient credits or budget cap exceeded
422 Processing error (circular dependency in CALCULATED fields, required field not extractable, LLM parsing failure)
429 Rate limit exceeded