The Preprocessing Problem Nobody Talks About
Every LLM pipeline that processes documents hits the same wall. You get a PDF or image file, and the model needs clean text. What sits between the file and the model is always messier than expected.
The naive approach — pass the raw file to an OCR library, clean up the output with regex, hope for the best — produces unreliable results on anything but the simplest text documents. Tables collapse into a jumbled string. Multi-column layouts get interleaved. Headers and footers land in the middle of paragraphs. Scanned pages fail silently and return empty strings.
The result is a preprocessing layer that grows into a maintenance burden: custom handling for each file type, edge cases that break quarterly, no confidence that the output is actually correct.
This is the problem the Document to Markdown API solves. Send a PDF, image, or Office document — get clean markdown back. Tables preserved. Images described. OCR built in.
How It Works
The API runs one step of the same ingestion pipeline that powers the Document Extraction API. For text-based files (DOCX, XLSX, CSV, TXT, HTML), content is extracted and normalized to consistent markdown. For PDFs, pages are rendered and processed with a layout-aware OCR model that understands tables, columns, headers, and footers. For images, a vision model runs both OCR and a semantic description of the visual content.
The output is clean markdown. Headings are headings. Tables are markdown tables with proper columns. Lists keep their nesting. The structure of the original document is preserved in a format that LLMs, RAG pipelines, and humans can all read directly.
The API Call
curl -X POST https://api.iterationlayer.com/document-to-markdown/v1/convert \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"file": {
"type": "url",
"name": "report.pdf",
"url": "https://example.com/report.pdf"
}
}'import { IterationLayer } from "iterationlayer";
const client = new IterationLayer({ apiKey: "YOUR_API_KEY" });
const result = await client.convertToMarkdown({
file: {
type: "url",
name: "report.pdf",
url: "https://example.com/report.pdf",
},
});
console.log(result.markdown);from iterationlayer import IterationLayer
client = IterationLayer(api_key="YOUR_API_KEY")
result = client.convert_to_markdown(
file={
"type": "url",
"name": "report.pdf",
"url": "https://example.com/report.pdf",
}
)
print(result["markdown"])import il "github.com/iterationlayer/sdk-go"
client := il.NewClient("YOUR_API_KEY")
result, err := client.ConvertToMarkdown(il.ConvertRequest{
File: il.NewFileFromURL(
"report.pdf",
"https://example.com/report.pdf",
),
})
fmt.Println(result.Markdown)The response for a PDF or text document:
{
"success": true,
"data": {
"name": "report.pdf",
"mime_type": "application/pdf",
"markdown": "# Quarterly Report\n\n## Revenue\n\n| Segment | Q3 2024 | Q3 2023 | Change |\n|---------|---------|---------|--------|\n| Enterprise | $12.4M | $10.1M | +22.8% |\n| SMB | $4.2M | $3.8M | +10.5% |\n\n## Summary\n\nTotal revenue increased 19.4% year-over-year..."
}
}Tables come out as markdown tables. Headings come out as headings. The content is ready to embed, display, or pass to an LLM.
The Image Description Field
Most document-to-text tools treat images as pixel grids to OCR. Text is extracted, everything else is discarded. This works for a scanned invoice but fails for a product photo, a chart, a diagram, or a screenshot.
When you send an image file to the Document to Markdown API, the response includes both the extracted text as markdown and a description field — a natural language description of what the image shows.
{
"success": true,
"data": {
"name": "architecture.png",
"mime_type": "image/png",
"markdown": "## System Overview\n\nLoad Balancer → App Server → Database",
"description": "A system architecture diagram showing a three-tier setup with a load balancer distributing traffic to two application servers, each connected to a primary-replica PostgreSQL database pair."
}
}
The description field only appears for image inputs. It is generated by a vision model that understands the image semantically — not just extracting text, but describing what’s depicted. This is useful for:
- Alt text generation — accessible descriptions from any image
- Image indexing — full-text search across visual content
- LLM context — giving a model visual understanding of a diagram or chart without multimodal input
- Content pipelines — turning mixed media (PDFs with text, images without text) into a uniform text representation
Supported Formats
One endpoint handles everything. No format-specific configuration, no conversion steps, no separate tool for images vs documents.
- PDF — text and scanned, with built-in OCR for scanned pages
- DOCX — Word documents with structure preserved
- XLSX — Excel spreadsheets rendered as markdown tables
- CSV — tabular data converted to markdown tables
- TXT — plain text passed through with minimal formatting
- HTML — content extracted, markup converted to markdown syntax
- PNG, JPEG, GIF, WebP — OCR for text, plus the description field for visual content
Files can be submitted as a URL or as base64-encoded content. Base64 is useful when the file isn’t publicly accessible — you read it from disk, encode it, and send it inline.
# Sending a base64-encoded file
curl -X POST https://api.iterationlayer.com/document-to-markdown/v1/convert \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"file": {
"type": "base64",
"name": "scan.png",
"base64": "'$(base64 -w 0 scan.png)'"
}
}'import { IterationLayer } from "iterationlayer";
import { readFileSync } from "fs";
const client = new IterationLayer({ apiKey: "YOUR_API_KEY" });
const fileBuffer = readFileSync("scan.png");
const result = await client.convertToMarkdown({
file: {
type: "base64",
name: "scan.png",
base64: fileBuffer.toString("base64"),
},
});import base64
from iterationlayer import IterationLayer
client = IterationLayer(api_key="YOUR_API_KEY")
with open("scan.png", "rb") as f:
file_base64 = base64.b64encode(f.read()).decode()
result = client.convert_to_markdown(
file={
"type": "base64",
"name": "scan.png",
"base64": file_base64,
}
)import (
"encoding/base64"
"os"
il "github.com/iterationlayer/sdk-go"
)
client := il.NewClient("YOUR_API_KEY")
fileBytes, _ := os.ReadFile("scan.png")
fileBase64 := base64.StdEncoding.EncodeToString(fileBytes)
result, err := client.ConvertToMarkdown(il.ConvertRequest{
File: il.FileInput{
Type: "base64",
Name: "scan.png",
Base64: fileBase64,
},
})Use Case: LLM Preprocessing
The most common use case is normalizing documents before passing them to an LLM. Consider a pipeline that classifies incoming documents — invoices, contracts, receipts — from a mix of senders. Some are PDFs with selectable text. Some are scanned images. Some are Word docs exported from accounting software.
Without preprocessing, your prompt needs to handle all these variations. With the Document to Markdown API, the LLM always receives the same format:
# Step 1: Convert document to markdown
MARKDOWN=$(curl -s -X POST https://api.iterationlayer.com/document-to-markdown/v1/convert \
-H "Authorization: Bearer $ITERATION_LAYER_API_KEY" \
-H "Content-Type: application/json" \
-d "{
\"file\": {
\"type\": \"url\",
\"name\": \"document.pdf\",
\"url\": \"$DOCUMENT_URL\"
}
}" | jq -r '.data.markdown')
# Step 2: Classify with an LLM
curl -s https://api.anthropic.com/v1/messages \
-H "x-api-key: $ANTHROPIC_API_KEY" \
-H "anthropic-version: 2023-06-01" \
-H "Content-Type: application/json" \
-d "{
\"model\": \"claude-sonnet-4-20250514\",
\"max_tokens\": 256,
\"messages\": [{
\"role\": \"user\",
\"content\": \"Classify this document as one of: invoice, contract, receipt, report, other.\n\n$MARKDOWN\"
}]
}"import { IterationLayer } from "iterationlayer";
import Anthropic from "@anthropic-ai/sdk";
const il = new IterationLayer({ apiKey: process.env.ITERATION_LAYER_API_KEY });
const anthropic = new Anthropic();
const { markdown } = await il.convertToMarkdown({
file: { type: "url", name: "document.pdf", url: documentUrl },
});
const response = await anthropic.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 256,
messages: [
{
role: "user",
content: `Classify this document as one of: invoice, contract, receipt, report, other.\n\n${markdown}`,
},
],
});import anthropic
from iterationlayer import IterationLayer
il = IterationLayer(api_key=os.environ["ITERATION_LAYER_API_KEY"])
claude = anthropic.Anthropic()
result = il.convert_to_markdown(
file={
"type": "url",
"name": "document.pdf",
"url": document_url,
}
)
response = claude.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=256,
messages=[
{
"role": "user",
"content": f"Classify this document as one of: invoice, contract, receipt, report, other.\n\n{result['markdown']}",
}
],
)import (
"context"
"fmt"
"os"
il "github.com/iterationlayer/sdk-go"
"github.com/anthropics/anthropic-sdk-go"
)
ilClient := il.NewClient(os.Getenv("ITERATION_LAYER_API_KEY"))
claude := anthropic.NewClient()
result, err := ilClient.ConvertToMarkdown(il.ConvertRequest{
File: il.NewFileFromURL("document.pdf", documentURL),
})
response, err := claude.Messages.New(context.Background(), anthropic.MessageNewParams{
Model: "claude-sonnet-4-20250514",
MaxTokens: 256,
Messages: []anthropic.MessageParam{
anthropic.NewUserMessage(
anthropic.NewTextBlock(
fmt.Sprintf("Classify this document as one of: invoice, contract, receipt, report, other.\n\n%s", result.Markdown),
),
),
},
})The markdown is clean regardless of whether the source was a scanned image or a text PDF. Your prompt stays simple. Your failure modes drop.
Use Case: RAG Ingestion
For RAG pipelines, document quality at ingestion time determines retrieval quality downstream. Garbage in, garbage out — if the markdown has garbled tables or missing headings, the chunks are bad, the embeddings are bad, and the retrieval is bad.
The Document to Markdown API provides clean markdown that chunks well. Tables are self-contained. Headings mark section boundaries. Paragraphs are complete. When you split this into chunks for embedding, each chunk is a coherent piece of content.
Use Case: Knowledge Base Building
Teams that maintain internal knowledge bases — Notion, Confluence, or custom wikis — often need to ingest external documents. Contracts from partners, specifications from vendors, reports from agencies. The Document to Markdown API converts these to markdown that drops cleanly into any knowledge base system.
The image description field is particularly useful here. A diagram embedded in a specification becomes searchable text. A chart in a report gets a description that a text search can find.
When to Use This vs. Document Extraction
Document Extraction builds on the same pipeline but adds a second step: it takes the markdown and uses an LLM to extract specific named fields according to a schema you define.
Use Document Extraction when:
- You know exactly which fields you need (invoice number, total, line items)
- You need typed values (dates as ISO dates, currencies as structured objects)
- You need confidence scores and source citations per field
- The output feeds into a database or structured workflow
Use Document to Markdown when:
- You need the full document content as text
- The output feeds into an LLM, a RAG pipeline, or a knowledge base
- You don’t know in advance which fields matter
- You need a uniform text representation of mixed file types
Getting Started
The API is available to all Iteration Layer accounts. See the documentation for the full request and response reference, or jump to the recipes for copy-paste examples covering invoices, contracts, and resumes.
The MCP server also exposes document-to-markdown as a tool — read how to use it from Claude Code or Cursor.