Document Extraction
Extract structured data from any document with a single API call. Send one or more files and a schema defining the fields you need, and receive typed, validated results with confidence scores and source citations.
Key Features
- Multi-Format Support — Extract from PDF, DOCX, XLSX, images, HTML, Markdown, CSV, JSON, and plain text.
- 17 Field Types — Text, numbers, dates, booleans, enums, emails, IBANs, countries, currencies, addresses, arrays, and calculated fields.
- Structured Arrays — Extract repeating data like invoice line items with nested schemas.
- Calculated Fields — Define arithmetic operations (sum, subtract, multiply, divide) computed from other extracted fields.
- Confidence Scores — Every extracted value includes a confidence score between 0 and 1.
- Source Citations — Verbatim quotes from the document that support each extracted value.
- Schema Validation — Field schemas are validated before extraction, catching errors like circular dependencies or type mismatches early.
Overview
The Document Extraction API analyzes documents and extracts structured data based on a schema you define. You send one or more files (base64 or URL) and a schema with field definitions, and receive a JSON response with typed values, confidence scores, and citations.
Endpoint: POST /document-extraction/v1/extract
Supported formats: PDF, DOCX, XLSX/XLS, CSV, TXT, Markdown, JSON, HTML, PNG, JPEG, GIF, WebP
Limits:
- Max file size: 50 MB per file
Request Format
curl -X POST https://api.iterationlayer.com/document-extraction/v1/extract \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"files": [
{
"type": "base64",
"name": "invoice.pdf",
"base64": "<base64-encoded-file>"
}
],
"schema": {
"fields": [
{ "name": "invoice_number", "type": "TEXT", "description": "The invoice number" },
{ "name": "total_amount", "type": "CURRENCY_AMOUNT", "description": "Total invoice amount" }
]
}
}'import { IterationLayer } from "iterationlayer";
const client = new IterationLayer({ apiKey: "YOUR_API_KEY" });
const result = await client.extract({
files: [
{
type: "base64",
name: "invoice.pdf",
base64: "<base64-encoded-file>",
},
],
schema: {
fields: [
{ name: "invoice_number", type: "TEXT", description: "The invoice number" },
{ name: "total_amount", type: "CURRENCY_AMOUNT", description: "Total invoice amount" },
],
},
});from iterationlayer import IterationLayer
client = IterationLayer(api_key="YOUR_API_KEY")
result = client.extract(
files=[
{
"type": "base64",
"name": "invoice.pdf",
"base64": "<base64-encoded-file>",
}
],
schema={
"fields": [
{"name": "invoice_number", "type": "TEXT", "description": "The invoice number"},
{"name": "total_amount", "type": "CURRENCY_AMOUNT", "description": "Total invoice amount"},
]
},
)import il "github.com/iterationlayer/sdk-go"
client := il.NewClient("YOUR_API_KEY")
result, err := client.Extract(il.ExtractRequest{
Files: []il.FileInput{il.NewFileFromBase64("invoice.pdf", "<base64-encoded-file>")},
Schema: il.ExtractionSchema{
"invoice_number": il.NewTextFieldConfig("invoice_number", "The invoice number"),
"total_amount": il.NewCurrencyAmountFieldConfig("total_amount", "Total invoice amount"),
},
})Top-Level Fields
| Field | Type | Required | Description |
|---|---|---|---|
files |
array | Yes | List of file inputs to extract from (see File Input below) |
schema |
object | Yes | Extraction schema defining the fields to extract |
webhook_url |
string | No | HTTPS URL to receive results asynchronously. If provided, returns 201 immediately. See Webhooks. |
Async Mode
Add a webhook_url parameter to process the request in the background. The API returns 201 Accepted immediately and delivers the result to your webhook URL when processing completes. See Webhooks for payload format and retry behavior.
File Input
Provide each file as either base64 or a URL:
| Field | Type | Required | Description |
|---|---|---|---|
type |
string | Yes |
"base64" or "url" |
name |
string | Yes |
Filename with extension (e.g., invoice.pdf) |
base64 |
string |
If type is base64 |
Base64-encoded file content |
url |
string |
If type is url |
Publicly accessible URL to fetch the file from |
URL Example
{
"files": [
{
"type": "url",
"name": "invoice.pdf",
"url": "https://example.com/invoice.pdf"
}
],
"schema": {
"fields": [
{ "name": "invoice_number", "type": "TEXT", "description": "The invoice number" }
]
}
}
Schema Definition
The schema object contains a fields array. Each field has these base properties:
| Field | Type | Required | Description |
|---|---|---|---|
name |
string | Yes | Unique field identifier (used as key in the response) |
type |
string | Yes | One of the 17 supported field types |
description |
string | Yes | Natural language description of what to extract |
is_required |
boolean | No |
If true, returns an error when the field cannot be extracted and has no default. Default: false |
Each field type supports additional type-specific properties described below.
Field Types
TEXT
Short single-line text value.
| Property | Type | Required | Description |
|---|---|---|---|
max_length |
integer | No | Maximum character length (> 0) |
default_value |
string | No | Default value if not found |
{ "name": "company_name", "type": "TEXT", "description": "Name of the company" }
TEXTAREA
Multi-line text value.
| Property | Type | Required | Description |
|---|---|---|---|
max_length |
integer | No | Maximum character length (> 0) |
default_value |
string | No | Default value if not found |
{ "name": "notes", "type": "TEXTAREA", "description": "Additional notes or comments" }
INTEGER
Whole number value.
| Property | Type | Required | Description |
|---|---|---|---|
min |
integer | No | Minimum value (inclusive) |
max |
integer | No | Maximum value (inclusive) |
unit |
string | No |
Unit label (e.g., "kg", "items") |
default_value |
integer | No | Default value if not found |
{ "name": "quantity", "type": "INTEGER", "description": "Number of items ordered", "min": 1 }
DECIMAL
Floating-point number value.
| Property | Type | Required | Description |
|---|---|---|---|
min |
float | No | Minimum value (inclusive) |
max |
float | No | Maximum value (inclusive) |
decimal_points |
integer | No | Number of decimal places to round to (>= 0) |
unit |
string | No | Unit label |
default_value |
float | No | Default value if not found |
{ "name": "weight", "type": "DECIMAL", "description": "Package weight", "unit": "kg", "decimal_points": 2 }
DATE
Calendar date, extracted as an ISO 8601 string (YYYY-MM-DD).
| Property | Type | Required | Description |
|---|---|---|---|
allow_future_dates |
boolean | No | Whether to allow dates in the future |
allow_past_dates |
boolean | No | Whether to allow dates in the past |
{ "name": "invoice_date", "type": "DATE", "description": "Date the invoice was issued" }
DATETIME
Date and time, extracted as an ISO 8601 datetime string.
| Property | Type | Required | Description |
|---|---|---|---|
allow_future_dates |
boolean | No | Whether to allow dates in the future |
allow_past_dates |
boolean | No | Whether to allow dates in the past |
{ "name": "timestamp", "type": "DATETIME", "description": "Transaction timestamp" }
TIME
Time value (e.g., "14:30:00"). No additional parameters.
{ "name": "delivery_time", "type": "TIME", "description": "Scheduled delivery time" }
ENUM
One or more values from a predefined list. Extracted as a string array.
| Property | Type | Required | Description |
|---|---|---|---|
values |
string[] | Yes | Allowed options |
min_selected |
integer | No | Minimum number of selected values (>= 0) |
max_selected |
integer | No | Maximum number of selected values (> 0) |
default_value |
string[] | No | Default selected values |
{
"name": "payment_method",
"type": "ENUM",
"description": "How the invoice was paid",
"values": ["bank_transfer", "credit_card", "cash", "paypal"],
"max_selected": 1
}
BOOLEAN
True or false value.
| Property | Type | Required | Description |
|---|---|---|---|
default_value |
boolean | No | Default value if not found |
{ "name": "is_paid", "type": "BOOLEAN", "description": "Whether the invoice has been paid" }
Email address string.
| Property | Type | Required | Description |
|---|---|---|---|
default_value |
string | No | Default value if not found |
{ "name": "contact_email", "type": "EMAIL", "description": "Contact email address" }
IBAN
International Bank Account Number. Validated against the pattern ^[A-Z]{2}\d{2}[A-Z0-9]{11,30}$.
| Property | Type | Required | Description |
|---|---|---|---|
default_value |
string | No | Default value if not found |
{ "name": "bank_account", "type": "IBAN", "description": "Recipient IBAN" }
COUNTRY
ISO 3166-1 alpha-2 country code (e.g., "DE", "US").
| Property | Type | Required | Description |
|---|---|---|---|
default_value |
string | No | Must be a valid ISO 3166-1 alpha-2 code |
{ "name": "origin_country", "type": "COUNTRY", "description": "Country of origin" }
CURRENCY_CODE
ISO 4217 currency code (e.g., "EUR", "USD").
| Property | Type | Required | Description |
|---|---|---|---|
default_value |
string | No | Must be a valid ISO 4217 code |
{ "name": "currency", "type": "CURRENCY_CODE", "description": "Invoice currency" }
CURRENCY_AMOUNT
Numeric monetary amount.
| Property | Type | Required | Description |
|---|---|---|---|
min |
float | No | Minimum value (inclusive) |
max |
float | No | Maximum value (inclusive) |
decimal_points |
integer | No | Number of decimal places to round to (>= 0) |
default_value |
float | No | Default value if not found |
{ "name": "total_amount", "type": "CURRENCY_AMOUNT", "description": "Total invoice amount", "decimal_points": 2 }
ADDRESS
Structured address object. Extracted as an object with street, city, region, postal_code, and country fields.
| Property | Type | Required | Description |
|---|---|---|---|
allowed_country_codes |
string[] | No | Restrict to specific ISO 3166-1 alpha-2 country codes |
Response value shape:
{
"street": "123 Main St",
"city": "Berlin",
"region": "Berlin",
"postal_code": "10115",
"country": "DE"
}
{ "name": "billing_address", "type": "ADDRESS", "description": "Billing address", "allowed_country_codes": ["DE", "AT", "CH"] }
ARRAY
A list of structured objects, each conforming to a nested schema. Use this for repeating data like line items.
| Property | Type | Required | Description |
|---|---|---|---|
item_schema |
object | Yes |
Nested schema with a fields array |
The item_schema.fields array uses the same field configuration format as top-level fields.
{
"name": "line_items",
"type": "ARRAY",
"description": "Invoice line items",
"item_schema": {
"fields": [
{ "name": "description", "type": "TEXT", "description": "Item description" },
{ "name": "quantity", "type": "INTEGER", "description": "Quantity ordered", "min": 1 },
{ "name": "unit_price", "type": "CURRENCY_AMOUNT", "description": "Price per unit", "decimal_points": 2 },
{ "name": "total", "type": "CURRENCY_AMOUNT", "description": "Line item total", "decimal_points": 2 }
]
}
}
CALCULATED
A derived numeric value computed from other extracted fields. Not extracted from the document — calculated after all source fields are resolved.
| Property | Type | Required | Description |
|---|---|---|---|
operation |
string | Yes |
One of: "sum", "subtract", "multiply", "divide" |
source_field_names |
string[] | Yes | Names of fields to apply the operation to, in order |
unit |
string | No | Unit label |
Source fields must be numeric types: INTEGER, DECIMAL, CURRENCY_AMOUNT, or another CALCULATED. Circular dependencies are detected and rejected at validation time.
{
"name": "tax_amount",
"type": "CALCULATED",
"description": "Tax amount (total minus net)",
"operation": "subtract",
"source_field_names": ["total_amount", "net_amount"]
}
Response Format
Success Response
{
"success": true,
"data": {
"invoice_number": {
"type": "TEXT",
"value": "INV-2024-001",
"confidence": 0.98,
"citations": ["Invoice No: INV-2024-001"],
"source": "invoice.pdf"
},
"total_amount": {
"type": "CURRENCY_AMOUNT",
"value": 1250.00,
"confidence": 0.95,
"citations": ["Total: €1,250.00"],
"source": "invoice.pdf"
}
}
}
Each field in data contains:
| Field | Type | Description |
|---|---|---|
type |
string | The field type from the schema |
value |
varies | Extracted value (type depends on field type — see below) |
confidence |
float | Confidence score between 0.0 and 1.0 |
citations |
string[] | Verbatim quotes from the source document |
source |
string | Filename the value was extracted from |
Value types by field type:
| Field Type | Value Type |
|---|---|
TEXT, TEXTAREA, EMAIL, IBAN, COUNTRY, CURRENCY_CODE, DATE, DATETIME, TIME |
string |
INTEGER, DECIMAL, CURRENCY_AMOUNT, CALCULATED |
number |
BOOLEAN |
boolean |
ENUM |
string[] |
ADDRESS |
object |
ARRAY |
object[] |
Fields that could not be extracted and have no default_value are omitted from data (unless is_required is true, which causes an error). Fields resolved via default_value have a confidence of 1.0.
Recipes
For complete, runnable examples see the Recipes page.
- Automate Invoice Processing – Extract line items, totals, and vendor details from invoices into structured JSON.
- Parse Resumes and CVs – Pull contact info, work history, and skills from resumes into structured data.
- Extract Contract Clauses – Identify and extract specific clauses, dates, and parties from legal contracts.
- Parse Receipts and Expenses – Extract merchant, amount, date, and line items from receipt images and PDFs.
- Extract Product Catalog Data – Pull product names, prices, and specifications from catalog documents.
Error Responses
All errors return a JSON body with { "success": false, "error": "<message>" }.
| Status | Description |
|---|---|
| 400 | Invalid request (missing files/schema, invalid base64, URL fetch failure, file size exceeded, invalid field config) |
| 401 | Missing or invalid API key |
| 402 | Insufficient credits or budget cap exceeded |
| 422 | Processing error (circular dependency in CALCULATED fields, required field not extractable, LLM parsing failure) |
| 429 | Rate limit exceeded |