PDFs Weren’t Designed for Data Extraction
PDF is a page-layout format. It knows where to put characters on a page. It does not know that those characters form an invoice number, or that the block of text in the middle is an address, or that the grid of numbers is a table of line items.
Every developer who’s tried to extract data from PDFs has gone through the same progression: start with a text extraction library, realize it loses all structure, try regex, realize it breaks across layouts, try a template-based parser, realize it only works for one document format.
The Document Extraction API takes a different approach. You define a schema — the fields you want — and the API handles the rest. OCR for scans, structure detection for tables, confidence scores for every extracted value.
The Three-Line Version
At its simplest:
import { IterationLayer } from "iterationlayer";
const client = new IterationLayer({ apiKey: "YOUR_API_KEY" });
const result = await client.extract({
files: [{ type: "url", name: "report.pdf", url: "https://example.com/report.pdf" }],
schema: {
fields: [
{ name: "title", type: "TEXT", description: "Document title" },
{ name: "author", type: "TEXT", description: "Author name" },
{ name: "publish_date", type: "DATE", description: "Publication date" },
],
},
});
Three fields, one API call, structured JSON back. But the API handles far more complex extractions too.
Schema-Based Extraction
The key idea: you describe what you want, not where it is. The schema is an array of field definitions, each with a name, a type, and a description that tells the parser what to look for.
The parser supports 17 field types:
- Text: TEXT, TEXTAREA
- Numbers: INTEGER, DECIMAL
- Dates and times: DATE, DATETIME, TIME
- Financial: CURRENCY_AMOUNT, CURRENCY_CODE, IBAN
- Geographic: ADDRESS, COUNTRY
- Selection: ENUM, BOOLEAN
- Contact: EMAIL
- Structured: ARRAY (tables and repeated data)
- Computed: CALCULATED (derive values from other fields)
Each type comes with specific validation. An IBAN field isn’t just a string — it’s validated as a proper IBAN. A CURRENCY_CODE returns an ISO 4217 code. An ADDRESS decomposes into street, city, region, postal code, and country.
A Real-World Example
Parsing a company registration document:
const result = await client.extract({
files: [{ type: "base64", name: "registration.pdf", base64: pdfBase64 }],
schema: {
fields: [
{
name: "company_name",
type: "TEXT",
description: "Registered company name",
is_required: true,
},
{
name: "registration_number",
type: "TEXT",
description: "Company registration or incorporation number",
is_required: true,
},
{
name: "registration_date",
type: "DATE",
description: "Date of incorporation",
},
{
name: "registered_address",
type: "ADDRESS",
description: "Official registered address of the company",
},
{
name: "directors",
type: "ARRAY",
description: "List of company directors",
item_schema: {
fields: [
{ name: "name", type: "TEXT", description: "Director full name" },
{ name: "appointed_date", type: "DATE", description: "Date appointed" },
{ name: "nationality", type: "COUNTRY", description: "Director nationality" },
],
},
},
{
name: "share_capital",
type: "CURRENCY_AMOUNT",
description: "Total authorized share capital",
},
{
name: "currency",
type: "CURRENCY_CODE",
description: "Currency of the share capital",
},
],
},
});
The response comes back structured:
{
"company_name": {
"type": "TEXT",
"value": "Northwind Trading GmbH",
"confidence": 0.97
},
"registered_address": {
"type": "ADDRESS",
"value": {
"street": "Friedrichstraße 43",
"city": "Berlin",
"region": "Berlin",
"postal_code": "10117",
"country": "DE"
},
"confidence": 0.94
},
"directors": {
"type": "ARRAY",
"value": [
[
{ "value": "Maria Schmidt", "confidence": 0.96 },
{ "value": "2024-03-15", "confidence": 0.93 },
{ "value": "DE", "confidence": 0.95 }
],
[
{ "value": "James Chen", "confidence": 0.94 },
{ "value": "2024-03-15", "confidence": 0.91 },
{ "value": "GB", "confidence": 0.93 }
]
],
"confidence": 0.94
}
}
The ADDRESS field decomposes automatically. The COUNTRY field returns an ISO 3166-1 alpha-2 code. The ARRAY handles variable-length director lists without any schema changes.
Confidence Scores on Everything
Every extracted field includes a confidence score between 0.0 and 1.0. This is critical for production use.
A document with clean digital text and clear formatting will score high — 0.90 and above. A scanned document with coffee stains and skewed alignment will score lower. Your code can route high-confidence results straight to your database and flag low-confidence results for human review.
File Inputs: URLs or Base64
Two ways to send files:
-
URL — point to a file hosted somewhere:
{ type: "url", name: "doc.pdf", url: "https://..." } -
Base64 — embed the file contents:
{ type: "base64", name: "doc.pdf", base64: "..." }
The parser handles PDFs, Word documents (DOCX), images (PNG, JPG, GIF, WEBP), and text files (MD, TXT, CSV, JSON). Images get OCR automatically — no separate step.
Handling the Response
The API returns a JSON object with success: true and a data object containing each field you requested. Check the HTTP status code first, then process the data:
// The SDK throws on errors (invalid schema, file too large, auth failure).
// Wrap in try/catch for production code.
Common error scenarios to handle:
- 401 Unauthorized — invalid or missing API key
-
400 Bad Request — malformed schema (e.g., ARRAY field missing
item_schema, unknown field type, more than 100 schema fields) - 413 Payload Too Large — file exceeds 50 MB, or total payload exceeds 200 MB
- 422 Unprocessable Entity — the file couldn’t be read (corrupted PDF, unsupported format)
For production code, check both the HTTP status and the success field. A 200 response with success: true means the extraction completed. Each field in the data object has a value and a confidence score that you can use to decide whether to accept or flag the result.
Computed Fields with CALCULATED
Some extractions need derived values. The CALCULATED field type computes a result from other extracted fields:
{
name: "total_capital",
type: "CALCULATED",
description: "Sum of all share classes",
operation: "sum",
source_field_names: ["ordinary_shares", "preference_shares"],
}
Four operations are available: sum, subtract, multiply, and divide. The source fields must be numeric types (INTEGER, DECIMAL, or CURRENCY_AMOUNT). The parser extracts the source fields first, then computes the result. This is useful for cross-checking values in the document — if the computed total doesn’t match an extracted total, something is off.
Batch Processing
Need to parse multiple files? Send up to 20 documents in a single request, with a combined size up to 200 MB (50 MB per file). The parser extracts the same schema from each file and returns results for every document individually, each with its own confidence scores.
Get Started
The full API reference, field type documentation, and SDK guides are in the docs. Install the TypeScript SDK (@iterationlayer/parser) or Python SDK and start extracting structured data from your next PDF.
Sign up for a free account — no credit card required. Define a schema, send a document, see the result. The schema you write for your first test document is the same schema you use in production — no configuration changes needed as you scale from one document to thousands.