Your RAG Pipeline Is Only as Good as Its Ingestion
Every team building retrieval-augmented generation hits the same bottleneck, and it is not the vector database, the embedding model, or the retrieval algorithm. It is the step before all of those: getting clean text out of the source documents.
You have a pile of PDFs, Word documents, scanned contracts, and spreadsheets. Your RAG pipeline needs them as text. What sits between the file and the embedding model is always messier than anyone budgets for. The naive approach — run an OCR library, strip the markup, split on newlines — produces output that looks plausible until you inspect it. Tables collapse into jumbled strings. Multi-column layouts get interleaved. Headers from page footers land in the middle of paragraphs. Scanned pages return empty strings with no error.
The result is bad chunks, bad embeddings, and bad retrieval. The LLM confidently answers questions using garbage context, and nobody notices until a customer complains.
This guide covers why markdown is the right intermediate format for LLM ingestion, what goes wrong with common extraction approaches, and how to build an ingestion pipeline that produces clean, chunk-friendly text from any document type.
Why Markdown for LLM Ingestion
When you feed documents into a RAG pipeline, you need a text format that satisfies three constraints simultaneously:
- LLMs can read it natively. Every major model understands markdown. Headings, lists, tables, emphasis — all parsed correctly without additional prompting or formatting instructions.
-
It preserves document structure. Unlike plain text, markdown retains hierarchy. An
## Section Titleis semantically different from a paragraph. A markdown table preserves row-column relationships. This structure matters for chunking and retrieval. - It is lightweight and universal. No binary dependencies, no rendering engine, no schema to learn. Markdown is a string. You can store it, search it, diff it, and pipe it anywhere.
HTML satisfies the first two but fails the third — it is verbose, full of presentational noise, and requires parsing to extract content. JSON is structured but not human-readable and not something LLMs consume as naturally as prose. Plain text strips all structure, turning a table into an unreadable mess.
Markdown sits in the sweet spot. It is the format that loses the least information while remaining the simplest to work with downstream.
What Goes Wrong with Common Extraction Approaches
Before covering the right way, it is worth understanding why the obvious approaches fail. Each one breaks in a specific, predictable way.
Raw OCR Output
Libraries like Tesseract produce character-level output from rasterized pages. The output is a flat string with no structural awareness. A two-column academic paper becomes interleaved text — the left column’s first line followed by the right column’s first line, alternating down the page. Tables become sequences of cell values with no indication of which row or column they belong to. Headers and footers appear inline with body text.
For a simple single-column text document, raw OCR works fine. For anything with layout complexity — which describes most real-world documents — the output is unreliable.
PDF Text Extraction Libraries
Tools like PyPDF2 or pdfplumber extract text by reading the PDF’s internal text objects. This avoids OCR entirely, which is faster but means scanned pages return nothing. Worse, PDFs store text as positioned character runs, not semantic paragraphs. A word that appears centered on the page might be stored as three separate text objects. Table cells are individual positioned strings with no table structure linking them.
These tools give you the characters on the page. They do not give you the document.
LLM-Based Extraction Without Preprocessing
Some teams skip text extraction entirely and pass document images directly to a multimodal LLM. “Here is a screenshot of page 7, extract the table.” This works for individual pages but does not scale. Processing a 200-page document as 200 separate image prompts is slow, expensive, and loses cross-page context. A table that spans pages 14-15 gets split into two unrelated fragments.
Multimodal models also hallucinate content they cannot read clearly — low-contrast text, small fonts, watermarks. There is no confidence signal telling you whether the model is reading or guessing.
Regular Expressions on Extracted Text
The worst approach is also the most common: extract text with whatever tool is handy, then write regex patterns to find the pieces you need. This works exactly once — for the exact document format you tested against. The next vendor sends a slightly different invoice layout, and your regex returns nonsense or nothing.
Regex-based extraction is the glue code that turns a one-hour task into a permanent maintenance burden.
Layout Preservation: Why It Matters for Chunks
The quality of your RAG retrieval depends directly on the quality of your chunks. And the quality of your chunks depends on whether the extraction step preserved the document’s structure.
Consider a financial report with this layout:
## Q3 Revenue by Segment
| Segment | Q3 2025 | Q3 2024 | Change |
|------------|----------|----------|---------|
| Enterprise | $12.4M | $10.1M | +22.8% |
| SMB | $4.2M | $3.8M | +10.5% |
| Consumer | $1.8M | $2.1M | -14.3% |
Consumer revenue declined due to the sunset of the legacy product line.
If your extraction preserves this structure, the chunk is self-contained. A query like “What happened to consumer revenue in Q3?” retrieves a chunk with the table and the explanation together. The LLM can answer accurately.
Now consider what raw OCR produces from the same page:
Q3 Revenue by Segment Segment Q3 2025 Q3 2024 Change Enterprise $12.4M $10.1M +22.8% SMB $4.2M $3.8M +10.5% Consumer $1.8M $2.1M -14.3% Consumer revenue declined due to the sunset of the legacy product line.
One long string. No table structure. No heading. When this gets chunked, the table data might end up in a different chunk than the explanation. The row-column relationships are lost entirely. The LLM has to guess which numbers belong to which segment.
Layout preservation is not a nice-to-have. It is the difference between a RAG pipeline that works and one that produces plausible-sounding wrong answers.
Table Extraction: The Hardest Problem
Tables are where most extraction pipelines fall apart. A table is a two-dimensional data structure rendered as a visual grid. Converting it back to structured text requires understanding both the visual layout and the semantic relationships between cells.
Common failure modes:
- Merged cells — a header that spans three columns gets assigned to only the first column, or duplicated across all three
- Multi-line cells — a cell containing a paragraph wraps across visual lines, and the extractor treats each line as a separate row
- Implicit headers — some tables use the first row as a header without any visual distinction, and the extractor treats it as body data
- Nested tables — a table inside a table, common in government and legal documents, collapses into a single flat structure
- Spanning tables — a table that continues across page boundaries gets split into two unrelated fragments
The Document to Markdown API handles these cases by running a layout-aware model that understands tables as two-dimensional structures, not sequences of positioned text. The output is a markdown table with proper columns, rows, and alignment — ready to embed or pass to an LLM.
Why Tables Matter More in RAG Than in Other Contexts
Tables are disproportionately important in RAG because they tend to hold the highest-density factual content in a document. A paragraph might discuss revenue trends in general terms. The table next to it has the exact numbers, broken down by segment, quarter, and region. When a user asks a specific question — “What was enterprise revenue in Q3?” — the answer is in the table, not the paragraph.
If your extraction pipeline garbles the table, two things go wrong. First, the chunk containing the table data produces a poor embedding because the text is incoherent. The vector does not capture the semantic meaning of “enterprise revenue was $12.4M.” Second, even if retrieval happens to return the chunk, the LLM receives jumbled text and either hallucinates a number or admits it cannot find the answer.
Markdown tables solve both problems. The row-column structure is preserved in plain text. The embedding model sees coherent, structured data. The LLM receives a table it can read, with column headers providing context for every cell value.
Embedding Strategies for Table-Heavy Documents
When a document contains many tables, consider embedding the table and its surrounding context as a single unit, but also creating a separate “table-only” embedding. This gives you two retrieval paths: one that matches on the narrative context around the table, and one that matches directly on the tabular data. At retrieval time, merge results from both paths and deduplicate.
For very large tables — financial statements, product catalogs, audit logs — split at logical row boundaries (e.g., by category or time period) and repeat the column headers in each chunk. A chunk that reads | Segment | Revenue | ... |\n| Enterprise | $12.4M | ... | is self-contained. A chunk that reads | $12.4M | ... | without headers is not.
Handling Scanned Documents
Scanned documents — PDFs created from a scanner or camera, with no selectable text layer — require OCR before any text extraction can happen. But OCR alone is not enough. The OCR gives you characters. You still need a model that understands page layout to turn those characters into structured text.
The Document to Markdown API runs OCR and layout analysis as a single pipeline. For text-based files (DOCX, XLSX, CSV, TXT, HTML), content is extracted and normalized directly. For PDFs, pages are rendered and processed with a layout-aware model that understands tables, columns, headers, and footers. For images, a vision model runs both OCR and a semantic description of the visual content.
You do not need to detect whether a document is scanned or text-based. The API handles both. Send a PDF with some scanned pages and some text pages — the output is the same clean markdown regardless.
The API Call
One endpoint, any document type. No format-specific configuration.
curl -X POST https://api.iterationlayer.com/document-to-markdown/v1/convert \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"file": {
"type": "url",
"name": "report.pdf",
"url": "https://example.com/report.pdf"
}
}'import { IterationLayer } from "iterationlayer";
const client = new IterationLayer({ apiKey: "YOUR_API_KEY" });
const result = await client.convertToMarkdown({
file: {
type: "url",
name: "report.pdf",
url: "https://example.com/report.pdf",
},
});
console.log(result.markdown);from iterationlayer import IterationLayer
client = IterationLayer(api_key="YOUR_API_KEY")
result = client.convert_to_markdown(
file={
"type": "url",
"name": "report.pdf",
"url": "https://example.com/report.pdf",
}
)
print(result["markdown"])package main
import (
"fmt"
il "github.com/iterationlayer/sdk-go"
)
func main() {
client := il.NewClient("YOUR_API_KEY")
result, err := client.ConvertToMarkdown(il.ConvertRequest{
File: il.NewFileFromURL(
"report.pdf",
"https://example.com/report.pdf",
),
})
if err != nil {
panic(err)
}
fmt.Println(result.Markdown)
}The response:
{
"success": true,
"data": {
"name": "report.pdf",
"mime_type": "application/pdf",
"markdown": "# Quarterly Report\n\n## Revenue\n\n| Segment | Q3 2025 | Q3 2024 | Change |\n|---------|---------|---------|--------|\n| Enterprise | $12.4M | $10.1M | +22.8% |\n| SMB | $4.2M | $3.8M | +10.5% |\n\n## Summary\n\nTotal revenue increased 19.4% year-over-year..."
}
}Tables come out as markdown tables. Headings come out as headings. The content is ready to embed, chunk, or pass to an LLM.
Image Descriptions for Visual Content
Most document-to-text tools treat images as pixel grids to OCR. Text is extracted, everything else is discarded. This works for a scanned invoice but fails for a product photo, a chart, a diagram, or an architecture screenshot.
When you send an image file to the Document to Markdown API, the response includes both the extracted text as markdown and a description field — a natural language description of what the image shows.
{
"success": true,
"data": {
"name": "architecture.png",
"mime_type": "image/png",
"markdown": "## System Overview\n\nLoad Balancer → App Server → Database",
"description": "A system architecture diagram showing a three-tier setup with a load balancer distributing traffic to two application servers, each connected to a primary-replica PostgreSQL database pair."
}
}
The description field only appears for image inputs. It is generated by a vision model that understands the image semantically — not just extracting text, but describing what is depicted. For a RAG pipeline, this means diagrams and charts become searchable, embeddable text. A query about “database architecture” can retrieve a chunk that includes the description of an architecture diagram, even though the diagram itself contains no text mentioning databases.
Chunking Strategies for Markdown
Once you have clean markdown, the next step is splitting it into chunks for embedding. The structure that markdown preserves makes this significantly easier than chunking raw text.
Heading-Based Chunking
The simplest and most effective strategy. Split the document at heading boundaries. Each chunk starts with a heading and contains everything until the next heading of the same or higher level.
## Revenue ← chunk boundary
...paragraph...
...table...
## Expenses ← chunk boundary
...paragraph...
### Operating Costs ← chunk boundary (sub-section)
...paragraph...
This works because headings in a well-structured document correspond to topic boundaries. Each chunk is about one thing. Retrieval queries match against coherent topics, not arbitrary text windows.
Table-Aware Chunking
Tables should never be split across chunks. A row without its header is meaningless. A header without its rows is useless. When you encounter a table in the markdown, keep it as a single unit within its chunk.
If a table is too large for your chunk size limit, include the header row in every chunk that contains table rows. This way, each chunk is self-contained — the column names travel with the data.
Metadata-Enriched Chunks
Each chunk benefits from metadata that helps the retrieval step. At minimum, include:
- Source document name — which file this chunk came from
- Section path — the heading hierarchy (e.g., “Revenue > Q3 Results > Regional Breakdown”)
- Chunk index — position within the document, for ordering results
- Document type — if you are ingesting multiple document types, tag each chunk
This metadata does not need to be part of the embedded text. Store it alongside the embedding vector and use it for filtering or re-ranking at retrieval time.
Some metadata is worth prepending to the chunk text before embedding. The section path — “Q3 Financial Report > Revenue > Regional Breakdown” — adds semantic signal that helps the embedding model place the chunk in the right neighborhood of the vector space. A chunk that starts with “Regional Breakdown” alone is less specific than one that starts with “Q3 Financial Report > Revenue > Regional Breakdown.” The difference shows up at retrieval time when multiple documents discuss revenue and the query needs to match the right one.
Other metadata — document name, chunk index, ingestion timestamp — is purely structural. Embedding it adds noise. Store it as filterable fields in your vector database. When a user asks “What does the employee handbook say about PTO?” you filter by document type before running the vector search, narrowing the candidate set and improving precision.
Overlap Strategy
When splitting at heading boundaries, some context can be lost at chunk edges. A common mitigation is to include the last paragraph of the previous chunk as a prefix to the current chunk. This gives the embedding model cross-boundary context without duplicating entire sections.
For markdown, a better approach is to include the parent heading chain as a prefix. If a chunk starts at ### Operating Costs, prefix it with ## Expenses > ### Operating Costs. This gives the embedding model hierarchical context without duplicating content.
Chunking Strategies by Document Type
Different document types have different internal structures, and a single chunking strategy does not work well across all of them. The right approach depends on what the document looks like after conversion to markdown.
Long-form reports and whitepapers. These have clear heading hierarchies. Heading-based chunking works well. Target 500-1,000 tokens per chunk. Respect heading levels — a ## section with three ### subsections should produce three or four chunks (one per subsection, possibly one for the ## intro paragraph), not one giant chunk for the entire section.
Legal contracts. Contracts are structured by numbered clauses, not markdown headings. After conversion, clause numbers often appear as headings or bold text. Chunk at the clause level. Each clause is typically self-contained — it defines one obligation, one condition, or one exception. Keep the clause number and any parent section reference in the chunk metadata so the LLM can cite specific clauses in its response.
Financial statements and spreadsheets. These are almost entirely tables. After conversion from XLSX or from a PDF containing financial data, the markdown is a sequence of tables with minimal surrounding text. Chunk each table as a unit with its title. If a table exceeds your token limit, split by logical row groups (by quarter, by department, by account category) and repeat column headers.
Receipts, invoices, and short-form documents. A single-page document that converts to fewer than 500 tokens should be a single chunk. Do not split it. The overhead of multiple chunks — multiple embeddings, multiple retrieval candidates — is not worth it when the entire document fits in one.
Slide decks (converted from PPTX or images). Each slide is a self-contained unit. Chunk per slide. The markdown for a slide is typically a heading, a few bullet points, and possibly an image description. Keep the slide number in the metadata for ordering.
Mixed-format documents. Some documents — employee handbooks, product manuals, government filings — contain prose sections, tables, images, and lists all interleaved. Use heading-based chunking as the primary strategy, but apply the table-aware rule: never split a table across chunks. If a heading section contains both prose and a table, and the combined text exceeds your token limit, split the prose from the table but keep the heading with both chunks.
Building an Ingestion Pipeline
A complete RAG ingestion pipeline takes a collection of documents and produces embedded, indexed chunks ready for retrieval. Here is the workflow, from raw files to searchable knowledge base.
Step 1: Convert Documents to Markdown
The first step normalizes every document type into clean markdown.
# Convert a batch of documents
for FILE_URL in "${DOCUMENT_URLS[@]}"; do
FILENAME=$(basename "$FILE_URL")
curl -s -X POST https://api.iterationlayer.com/document-to-markdown/v1/convert \
-H "Authorization: Bearer $ITERATION_LAYER_API_KEY" \
-H "Content-Type: application/json" \
-d "{
\"file\": {
\"type\": \"url\",
\"name\": \"$FILENAME\",
\"url\": \"$FILE_URL\"
}
}" >> converted_documents.jsonl
doneimport { IterationLayer } from "iterationlayer";
const client = new IterationLayer({
apiKey: process.env.ITERATION_LAYER_API_KEY,
});
const documentUrls = [
"https://example.com/report-q3.pdf",
"https://example.com/handbook.docx",
"https://example.com/pricing.xlsx",
];
const markdownDocuments = await Promise.all(
documentUrls.map((url) =>
client.convertToMarkdown({
file: {
type: "url",
name: url.split("/").pop(),
url,
},
})
)
);from iterationlayer import IterationLayer
client = IterationLayer(api_key=os.environ["ITERATION_LAYER_API_KEY"])
document_urls = [
"https://example.com/report-q3.pdf",
"https://example.com/handbook.docx",
"https://example.com/pricing.xlsx",
]
markdown_documents = [
client.convert_to_markdown(
file={
"type": "url",
"name": url.split("/")[-1],
"url": url,
}
)
for url in document_urls
]package main
import (
"os"
"path"
il "github.com/iterationlayer/sdk-go"
)
func main() {
client := il.NewClient(os.Getenv("ITERATION_LAYER_API_KEY"))
documentURLs := []string{
"https://example.com/report-q3.pdf",
"https://example.com/handbook.docx",
"https://example.com/pricing.xlsx",
}
var markdownDocuments []*il.MarkdownFileResult
for _, documentURL := range documentURLs {
result, err := client.ConvertToMarkdown(il.ConvertRequest{
File: il.NewFileFromURL(
path.Base(documentURL),
documentURL,
),
})
if err != nil {
panic(err)
}
markdownDocuments = append(markdownDocuments, result)
}
}At this point, every document — PDF, DOCX, XLSX, scanned image — is clean markdown. The rest of the pipeline does not need to know what the original format was.
Step 2: Chunk the Markdown
Split each markdown document into chunks using heading-based boundaries. Keep tables intact. Add metadata.
The exact chunking logic depends on your embedding model’s context window and your retrieval requirements. A common target is 500-1,000 tokens per chunk — large enough to be semantically meaningful, small enough to be specific.
For each chunk, generate a metadata object:
{
"source_document": "report-q3.pdf",
"section_path": "Revenue > Regional Breakdown",
"chunk_index": 4,
"document_type": "financial_report",
"has_table": true,
"ingested_at": "2026-04-14T10:30:00Z"
}
The has_table flag is useful for retrieval routing. When a user asks a question that implies a numerical answer, you can boost chunks that contain tables — they are more likely to hold the specific data point.
Step 3: Embed and Index
Pass each chunk through your embedding model and store the resulting vector alongside the chunk text and metadata in your vector database.
Before embedding, prepend the section path to the chunk text. This gives the embedding model additional context about where the chunk fits in the document hierarchy. A chunk that embeds as “Q3 Financial Report > Revenue > Regional Breakdown\n\n| Region | Revenue | …” produces a more specific vector than the table alone.
Choose your embedding model based on the content. For documents that are mostly prose, general-purpose models like OpenAI’s text-embedding-3-large or Cohere’s embed-v4 work well. For documents heavy on domain-specific terminology — legal contracts, medical records, engineering specifications — consider whether a domain-adapted model improves retrieval precision for your use case.
Store the embedding, the chunk text, and the metadata object in your vector database. Most vector databases (Pinecone, Weaviate, Qdrant, pgvector) support metadata filtering, which lets you narrow the search space before running the vector similarity query.
Step 4: Chain with Document Extraction for Structured Fields
Some documents need both full-text ingestion for RAG and structured field extraction for database storage. An invoice needs its full content indexed for search, but it also needs the invoice number, total, and line items extracted as typed fields.
The Document to Markdown API and the Document Extraction API share the same ingestion pipeline. You can call both on the same document — markdown for RAG, structured fields for your database — using the same credit pool and the same API credentials.
# Get markdown for RAG
MARKDOWN=$(curl -s -X POST https://api.iterationlayer.com/document-to-markdown/v1/convert \
-H "Authorization: Bearer $ITERATION_LAYER_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"file": {
"type": "url",
"name": "invoice.pdf",
"url": "https://example.com/invoice.pdf"
}
}' | jq -r '.data.markdown')
# Get structured fields for the database
FIELDS=$(curl -s -X POST https://api.iterationlayer.com/document-extraction/v1/extract \
-H "Authorization: Bearer $ITERATION_LAYER_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"files": [{
"type": "url",
"name": "invoice.pdf",
"url": "https://example.com/invoice.pdf"
}],
"schema": {
"fields": [
{
"type": "TEXT",
"name": "invoice_number",
"description": "The invoice number"
},
{
"type": "CURRENCY_AMOUNT",
"name": "total",
"description": "The total amount due"
}
]
}
}')import { IterationLayer } from "iterationlayer";
const client = new IterationLayer({
apiKey: process.env.ITERATION_LAYER_API_KEY,
});
const invoiceFile = {
type: "url" as const,
name: "invoice.pdf",
url: "https://example.com/invoice.pdf",
};
const [markdownResult, extractionResult] = await Promise.all([
client.convertToMarkdown({ file: invoiceFile }),
client.extract({
files: [invoiceFile],
schema: {
fields: [
{
type: "TEXT",
name: "invoice_number",
description: "The invoice number",
},
{
type: "CURRENCY_AMOUNT",
name: "total",
description: "The total amount due",
},
],
},
}),
]);
// markdownResult.markdown → RAG pipeline
// extractionResult.invoice_number.value → databasefrom iterationlayer import IterationLayer
client = IterationLayer(api_key=os.environ["ITERATION_LAYER_API_KEY"])
invoice_file = {
"type": "url",
"name": "invoice.pdf",
"url": "https://example.com/invoice.pdf",
}
markdown_result = client.convert_to_markdown(file=invoice_file)
extraction_result = client.extract(
files=[invoice_file],
schema={
"fields": [
{
"type": "TEXT",
"name": "invoice_number",
"description": "The invoice number",
},
{
"type": "CURRENCY_AMOUNT",
"name": "total",
"description": "The total amount due",
},
]
},
)
# markdown_result["markdown"] → RAG pipeline
# extraction_result["invoice_number"]["value"] → databasepackage main
import (
"fmt"
"os"
il "github.com/iterationlayer/sdk-go"
)
func main() {
client := il.NewClient(os.Getenv("ITERATION_LAYER_API_KEY"))
invoiceFile := il.NewFileFromURL(
"invoice.pdf",
"https://example.com/invoice.pdf",
)
markdownResult, err := client.ConvertToMarkdown(il.ConvertRequest{
File: invoiceFile,
})
if err != nil {
panic(err)
}
extractionResult, err := client.Extract(il.ExtractRequest{
Files: []il.FileInput{invoiceFile},
Schema: il.ExtractionSchema{
"invoice_number": il.NewTextFieldConfig("invoice_number", "The invoice number"),
"total": il.NewCurrencyAmountFieldConfig("total", "The total amount due"),
},
})
if err != nil {
panic(err)
}
fmt.Println(markdownResult.Markdown)
fmt.Println(extractionResult)
}One vendor, one credit pool, two complementary outputs from the same document.
Supported Formats
One endpoint handles everything. No format-specific configuration, no separate tools for images versus documents.
- PDF — text-based and scanned, with built-in OCR for scanned pages
- DOCX — Word documents with structure preserved
- XLSX — Excel spreadsheets rendered as markdown tables
- CSV — tabular data converted to markdown tables
- TXT — plain text passed through with minimal formatting
- HTML — content extracted, markup converted to markdown syntax
- PNG, JPEG, GIF, WebP — OCR for text, plus the description field for visual content
Files can be submitted as a URL or as base64-encoded content. Base64 is useful when the file is not publicly accessible — read it from disk, encode it, and send it inline.
curl -X POST https://api.iterationlayer.com/document-to-markdown/v1/convert \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"file": {
"type": "base64",
"name": "scan.png",
"base64": "'$(base64 -w 0 scan.png)'"
}
}'import { IterationLayer } from "iterationlayer";
import { readFileSync } from "fs";
const client = new IterationLayer({ apiKey: "YOUR_API_KEY" });
const fileBuffer = readFileSync("scan.png");
const result = await client.convertToMarkdown({
file: {
type: "base64",
name: "scan.png",
base64: fileBuffer.toString("base64"),
},
});import base64
from iterationlayer import IterationLayer
client = IterationLayer(api_key="YOUR_API_KEY")
with open("scan.png", "rb") as f:
file_base64 = base64.b64encode(f.read()).decode()
result = client.convert_to_markdown(
file={
"type": "base64",
"name": "scan.png",
"base64": file_base64,
}
)package main
import (
"encoding/base64"
"fmt"
"os"
il "github.com/iterationlayer/sdk-go"
)
func main() {
client := il.NewClient("YOUR_API_KEY")
fileBytes, err := os.ReadFile("scan.png")
if err != nil {
panic(err)
}
fileBase64 := base64.StdEncoding.EncodeToString(fileBytes)
result, err := client.ConvertToMarkdown(il.ConvertRequest{
File: il.FileInputBase64{
Type: "base64",
Name: "scan.png",
Base64: fileBase64,
},
})
if err != nil {
panic(err)
}
fmt.Println(result.Markdown)
}When to Use Document to Markdown vs. Document Extraction
Both APIs share the same ingestion pipeline. Document to Markdown stops after converting the file to text. Document Extraction adds a second step: it uses an LLM to extract specific named fields according to a schema you define.
Use Document to Markdown when:
- You need the full document content as text
- The output feeds into a RAG pipeline, LLM prompt, or knowledge base
- You do not know in advance which fields matter
- You need a uniform text representation of mixed file types
Use Document Extraction when:
- You know exactly which fields you need (invoice number, total, line items)
- You need typed values (dates as ISO dates, currencies as structured objects)
- You need confidence scores and source citations per field
- The output feeds into a database or structured workflow
Use both when:
- You need full-text search and structured field access for the same documents
- Your RAG pipeline answers open-ended questions, but your application also needs specific typed values for display, filtering, or computation
- You are building a document management system where users search by content and filter by metadata (date, vendor, amount, status)
For a RAG pipeline specifically, Document to Markdown is almost always the right starting point. The full markdown gives you complete coverage — every sentence, every table, every heading is available for retrieval. Document Extraction is the complement, not the replacement. Use it to pull out the structured fields that your application layer needs for things vectors cannot do: sorting invoices by total, filtering contracts by effective date, grouping documents by vendor.
A common architecture: run Document to Markdown on every incoming document and push the chunks into your vector store. For document types with known schemas — invoices, contracts, purchase orders — also run Document Extraction and store the structured fields in a relational database. The vector store handles “find me documents about X.” The relational database handles “show me all invoices over $10,000 from the last quarter.” Both queries hit the same documents, processed once, stored twice in complementary systems.
Common Pitfalls and How to Avoid Them
Chunking Before Cleaning
Some pipelines chunk first and clean later. This is backwards. If your extraction produces garbled text, splitting it into smaller pieces of garbled text does not help. Clean extraction first, chunking second.
Ignoring Table Context
A table without its preceding paragraph often lacks context. “Revenue by Segment” above a table tells you what the numbers mean. If you chunk the table separately from its heading and intro paragraph, the chunk is less useful for retrieval. Keep at least the immediately preceding heading and paragraph with any table chunk.
Treating All Documents the Same
A 200-page legal contract and a one-page receipt need different chunking strategies. The contract benefits from heading-based chunking with large chunks. The receipt is one chunk. Build your pipeline to adapt chunk size based on document length and structure.
Discarding Metadata
The document filename, creation date, and source system are retrieval signals. A query like “What was our Q3 revenue?” should prefer chunks from documents named “Q3 Report” over chunks from a general handbook that happens to mention revenue. Store and use metadata.
Getting Started
The Document to Markdown API is available to all Iteration Layer accounts. Every file type, every format, one endpoint. See the documentation for the full request and response reference.
If you are building a RAG pipeline, start with the conversion step. Get your documents into clean markdown. Then layer on chunking, embedding, and retrieval. The extraction quality determines everything downstream — get that right, and the rest of the pipeline follows.
For workflows that need both full-text and structured extraction, combine Document to Markdown with the Document Extraction API. Same auth, same credits, same API conventions. Parse the document once, use the output twice.