Iteration Layer

Convert Any Document to Clean Markdown in n8n

6 min read Document to Markdown

The Markdown Conversion Problem in n8n

You have a pile of documents — PDFs, DOCX files, HTML pages, scanned images — and you need them in Markdown. Maybe you are building a RAG pipeline and need clean text chunks. Maybe you are migrating a knowledge base from Confluence to a static site. Maybe you are feeding documents into an LLM and want structured input instead of raw text dumps.

The current n8n options are not great. You can chain Mistral OCR to extract text and then pass that text through an LLM to convert it into Markdown. Two services, two billing meters, and results that vary with every run. Tables come out mangled — columns collapse into comma-separated lines. Headers lose their hierarchy. Code blocks disappear entirely or get wrapped in paragraph tags.

Niche services like MinerU or Datalab.to exist, but they require HTTP Request nodes with manual auth handling, binary-to-base64 conversion glue, and custom error handling in Function nodes. You end up maintaining more integration code than workflow logic.

One Node, Clean Markdown

Iteration Layer Document to Markdown converts PDF, DOCX, HTML, images, and more into structured Markdown. Tables stay as Markdown tables. Headers keep their levels. Code blocks keep their fences. The conversion is deterministic — same document in, same Markdown out.

No OCR preprocessing step. No LLM token costs. No temperature parameter quietly rewriting your headings. One API call per document, fixed credit cost regardless of page count or content density.

The output is clean enough to commit straight into a Git repository, feed into a vector database, or render in any Markdown-compatible tool.

The Workflow: Document Upload to Markdown Output

Here is what we are building in n8n: a webhook that receives document uploads, converts them to Markdown through the Iteration Layer node, and outputs clean structured text ready for downstream processing. Two nodes. No Function nodes. No LLM chain.

Step 1: Webhook Trigger

Open the n8n canvas and add a new node. Search for “Webhook” and add it.

In the node settings, set the HTTP Method to POST. Set Response Mode to “Last Node” so the workflow returns the Markdown output to the caller. Under Options, set Binary Data to true. This lets the webhook accept file uploads as n8n binary data.

Copy the webhook URL — this is the endpoint your application, CMS, or another workflow will send documents to. You can test it with a simple curl command:

curl -X POST \
  -F "file=@document.pdf" \
  https://your-n8n-instance.com/webhook/document-to-markdown

Step 2: Iteration Layer (Document to Markdown)

Add an Iteration Layer node after the webhook trigger. In the Resource dropdown, select Document to Markdown.

Under File Input Mode, select Binary Data so the node picks up the uploaded file from the webhook trigger. The node automatically detects the file type — PDF, DOCX, HTML, images — no format parameter needed.

In the File Name field, enter the expression {{ $binary.file.fileName }} to pass through the original filename from the upload. This helps the conversion engine handle format-specific nuances.

That is the entire configuration. No schema to define, no output format to specify. The node returns the document content as Markdown text in the response body.

A typical response for a PDF with mixed content — headings, body text, a table, and a code snippet — looks like this:

## Quarterly Sales Report

Revenue grew 12% quarter-over-quarter, driven primarily
by enterprise contract renewals.

### Regional Breakdown

| Region | Q1 Revenue | Q2 Revenue | Growth |
|--------|-----------|-----------|--------|
| EMEA | $2.4M | $2.7M | 12.5% |
| APAC | $1.8M | $2.1M | 16.7% |
| Americas | $4.1M | $4.5M | 9.8% |

### Technical Notes

The reporting pipeline uses the following query:

```sql
SELECT region, quarter, SUM(revenue)
FROM contracts
GROUP BY region, quarter
```

Headers, tables, and code blocks — all preserved with correct Markdown syntax. No collapsed columns, no lost hierarchy, no paragraph-wrapped code.

Use Case: RAG Ingestion Prep

The most common reason to convert documents to Markdown is preparing them for retrieval-augmented generation. Vector databases need clean text chunks, and Markdown provides natural splitting points — headings create semantic boundaries, lists group related items, and code blocks stay intact.

Wire the Markdown output from the Iteration Layer node into a text splitter (the Langchain Text Splitter node works well here), then into an embeddings node and a vector store write. The Markdown structure means your chunks align with the document’s semantic structure instead of splitting mid-sentence or mid-table.

Compare that to splitting raw OCR text where the splitter has no structural cues and a table row might end up in a different chunk than its header.

Use Case: Knowledge Base Migration

Moving documentation from one platform to another — Confluence to GitBook, SharePoint to Notion, legacy CMS to a static site — means converting hundreds of documents to Markdown. Export the documents as PDF or HTML, feed them through this workflow, and get Markdown files ready for import.

Add a Loop Over Items node before the Iteration Layer node to process batches. Add a Write Binary File node or an S3 upload after it to store the results. The entire migration pipeline runs unattended.

Use Case: Content Migration

CMS migrations often involve converting rich HTML content to Markdown for a new headless CMS or a Markdown-based static site generator. The Document to Markdown API handles HTML input directly — pass in the HTML file and get back clean Markdown without the <div> soup, inline styles, and class attributes that HTML-to-Markdown regex scripts leave behind.

Chaining: Markdown to Structured Data

The Markdown output from Document to Markdown chains directly into Iteration Layer Document Extraction. This is useful when the source document is too complex for direct extraction — dense multi-column layouts, documents mixing narrative text with structured tables, or scanned images where direct field extraction struggles.

Convert the document to Markdown first, then extract structured data from the Markdown. The extraction engine works better on clean Markdown than on raw document bytes for these edge cases because the structure is already resolved.

In n8n, this means two Iteration Layer nodes in sequence:

  • Node 1 (Document to Markdown): Convert the source document to Markdown
  • Node 2 (Document Extraction): Extract specific fields from the Markdown output using a schema

The Markdown acts as a clean intermediate representation. Define your extraction schema on Node 2 the same way you would for any other document — field names, types, descriptions, required flags — and the extraction runs against the Markdown text.

Get Started

Install the Iteration Layer community node from the n8n UI — search for n8n-nodes-iterationlayer under Settings > Community Nodes. The Document to Markdown docs cover all supported input formats, output options, and edge case handling. The n8n integration docs walk through every resource and parameter.

Try it with one document. Upload a PDF with tables, headers, and code blocks. Compare the Markdown output to what you get from an OCR + LLM chain. The difference in structure preservation — especially for tables — is immediately visible. Sign up to get your API key.

Build your first workflow in minutes

Chain our APIs together and ship a complete pipeline before lunch. Free trial credits included — no credit card required.