Iteration Layer
Menu
Features
Use Cases
Docs
Resources
Pricing

RAG from Public Documentation Websites: Robots.txt, Terms, Retention, and Attribution

Public Docs Are the Easiest RAG Source to Get Wrong

Every AI support project eventually reaches for public documentation. The pages are already written. They are structured. They explain the product better than any internal wiki. A crawler can fetch them in minutes.

Then the problems start.

The docs site changes its navigation and half your URLs 404. Versioned pages duplicate the same paragraph across /v2/, /v3/, and /latest/. Code blocks lose their indentation and language annotations when your HTML parser flattens everything to text. Tables — the part developers actually reference — turn into a wall of unseparated words. The retrieval system answers confidently from a page that was deleted three months ago. And because there are no citations, the user cannot tell whether the model is quoting official docs or hallucinating something plausible.

Public documentation looks like low-friction input. In practice, it is harder to get right than your own internal content — because you do not control the source, the structure, or the lifecycle.

Three categories of problems make public docs uniquely tricky for RAG: legal constraints that do not apply to your own content, structural fidelity that most ingestion pipelines destroy, and lifecycle management for content you did not author and cannot predict. Each one can quietly degrade your system. Together, they are why the simplest-looking RAG source often produces the least reliable answers.

You Do Not Control This Content

When you build RAG over your own knowledge base, you control the content. You know when it changes, why it changes, and whether it should be in the index. Public docs are someone else’s content, served on someone else’s infrastructure, under someone else’s rules.

That changes the ingestion calculus in ways developers tend to skip over.

Robots.txt Is a Signal, Not a Contract

Start with robots.txt, but do not stop there.

robots.txt tells automated agents which paths the site owner asks crawlers not to access. It is not a complete legal framework, and it is not a universal answer. But ignoring it is a bad default. If the docs site disallows a path for automated access, your ingestion job should treat that as a stop sign unless you have a separate permission basis.

The nuances matter for documentation sites specifically:

A robots.txt check is five minutes of work and saves you from the most obvious compliance failures. But it is only the first check, not the last.

Terms Restrict More Than You Expect

Documentation sites often have terms of service that restrict automated access, redistribution, commercial use, or derivative datasets. Developers tend to skip these because the pages are public and technical. That shortcut is risky when the RAG system is part of a commercial product.

What to look for:

If the documentation belongs to your own product, these concerns are simple. If it belongs to a vendor, an open-source project, or a competitor, the review matters more.

The safest ingestion pattern: have explicit permission, use the docs for a legitimate purpose, retain only what you need, and link users back to the source.

Your Own Docs Are the Easy Case

For most teams building RAG, the first and best target is your own documentation. You control the content, you control the terms, you know when pages change, and you have every right to process them however you want.

The irony is that many teams skip their own docs (“we already have them in the CMS”) and jump straight to ingesting third-party documentation where every legal and lifecycle question is harder. Start with your own content. Build the pipeline right. Then extend it to external sources where the permission basis is clear.

HTML to Text Is Where Quality Dies

The technical problem with public docs is not fetching them. Any HTTP client can fetch an HTML page. The problem is what happens next.

Most ingestion pipelines strip HTML tags and dump the text content. That destroys the structure that makes documentation useful for retrieval.

Consider a typical API reference page. In the browser, it looks like this:

## Create an API Key

Send a POST request to create a new API key for your project.

```bash
curl -X POST https://api.example.com/keys \
  -H "Authorization: Bearer TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"name": "production", "scope": "project-123"}'
```

### Parameters

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| name  | string | yes    | Human-readable key name |
| scope | string | no     | Optional project scope |

> Warning: API keys are shown only once. Store them securely.

After a typical innerText extraction, the same content becomes:

Create an API Key Send a POST request to create a new API key for your project. curl -X POST https://api.example.com/keys -H Authorization: Bearer TOKEN -H Content-Type: application/json -d {"name": "production", "scope": "project-123"} Parameters Field Type Required Description name string yes Human-readable key name scope string no Optional project scope Warning: API keys are shown only once. Store them securely.

The second version still contains the words. It no longer contains the documentation.

The heading that marks a section boundary is gone — there is no way to chunk by topic. The code block lost its formatting, its language annotation, and the line continuations that make the curl command readable. The parameter table is now a run-on sentence where “name string yes” could mean anything. The warning callout, which is visually distinct in HTML, is indistinguishable from body text.

This matters for RAG in two concrete ways. First, embeddings trained on well-structured text perform worse on flattened strings where code, prose, and table data are concatenated without boundaries. Second, when the LLM generates an answer from a retrieved chunk, it cannot reproduce a table it never saw as a table, or format a code example it received as a single line.

Markdown Preserves What Matters

The fix is to normalize HTML into markdown before chunking. Markdown keeps:

The conversion is not trivial. Real documentation HTML includes navigation sidebars, footer links, cookie banners, breadcrumb trails, search widgets, and JavaScript-rendered content that is invisible in the raw HTML. A good HTML-to-markdown pipeline strips all of that and keeps only the content.

Here is what the conversion looks like using the Document to Markdown API, which handles HTML cleanup, table preservation, and code block detection:

curl -X POST \
  https://api.iterationlayer.com/document-to-markdown/v1/convert \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "file": {
      "type": "url",
      "url": "https://docs.example.com/api/authentication"
    }
  }'
import { IterationLayer } from "iterationlayer";

const client = new IterationLayer({ apiKey: "YOUR_API_KEY" });

const result = await client.convertDocumentToMarkdown({
  file: {
    type: "url",
    url: "https://docs.example.com/api/authentication",
  },
});

// result.markdown contains clean markdown with headings, code blocks, and tables preserved
from iterationlayer import IterationLayer

client = IterationLayer(api_key="YOUR_API_KEY")

result = client.convert_document_to_markdown(
    file={
        "type": "url",
        "url": "https://docs.example.com/api/authentication",
    }
)

# result["markdown"] contains clean markdown with headings, code blocks, and tables preserved
import il "github.com/iterationlayer/sdk-go"

client := il.NewClient("YOUR_API_KEY")

result, err := client.ConvertDocumentToMarkdown(il.ConvertDocumentToMarkdownRequest{
    File: il.NewWebsiteFromURL("https://docs.example.com/api/authentication"),
})

// result.Markdown contains clean markdown with headings, code blocks, and tables preserved

For JavaScript-rendered documentation sites (React, Next.js, single-page apps), add fetch_options to request browser-based retrieval:

{
  "file": {
    "type": "url",
    "url": "https://docs.example.com/api/authentication",
    "fetch_options": {
      "should_render_javascript": true
    }
  }
}

The output is clean markdown — ready for chunking, embedding, and retrieval.

Chunk by Documentation Structure, Not Token Windows

Once you have clean markdown, the next question is how to split it into chunks for embedding. The generic approach — fixed-size token windows with overlap — works for undifferentiated text. Documentation is not undifferentiated text.

Documentation has natural boundaries that token-window chunking ignores:

Split a table from its header, and the chunk is ambiguous. Split a code example from the paragraph that sets it up, and the chunk loses intent. Merge two unrelated sections into one chunk because they happen to fit in the token window, and the embedding represents neither topic well.

Use Heading Hierarchy as Your Chunk Map

Markdown headings give you a free topic tree. Use it.

For API reference pages, keep endpoint descriptions, parameter tables, and code examples together as one chunk. A chunk that contains only a parameter table without the endpoint path is not useful — the embedding model does not know what API those parameters belong to.

For guide pages, split at ## and ### headings, but carry the parent heading chain into chunk metadata. A chunk titled “Authentication” is ambiguous across products. A chunk titled “Payments API > Authentication > OAuth2 Flow” is specific enough that retrieval can rank it correctly.

For long sections, treat code blocks and tables as atomic units. If a section is too long for your embedding model’s context window, split at the nearest heading or paragraph boundary — never in the middle of a table or code block.

Carry Context Into Every Chunk

A chunk without context is a chunk that retrieves for the wrong queries. Every chunk in your index should carry:

This metadata is not optional overhead. It is what makes retrieval reliable and citations trustworthy.

Extract Page Metadata as Typed Fields

Some of that metadata — page title, product area, last-updated date — is visible on the page but not in the markdown body. You could try to infer it from the heading or parse it from HTML meta tags. Or you can extract it as typed, validated fields.

This is where Website Extraction fits into the pipeline. While Document to Markdown converts the page body for embedding, Website Extraction pulls structured metadata from the same page:

Request
curl -X POST \
  https://api.iterationlayer.com/website-extraction/v1/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "file": {
      "type": "url",
      "url": "https://docs.example.com/api/authentication"
    },
    "schema": {
      "fields": [
        {
          "name": "page_title",
          "type": "TEXT",
          "description": "The title of the documentation page"
        },
        {
          "name": "product_area",
          "type": "TEXT",
          "description": "The product area or API category this page covers"
        },
        {
          "name": "last_updated",
          "type": "DATE",
          "description": "The visible last-updated date, if shown on the page"
        },
        {
          "name": "api_version",
          "type": "TEXT",
          "description": "The API version this page documents, if specified"
        },
        {
          "name": "prerequisites",
          "type": "ARRAY",
          "description": "Prerequisites or requirements listed on the page",
          "fields": [
            {
              "name": "prerequisite",
              "type": "TEXT",
              "description": "A single prerequisite or requirement"
            }
          ]
        }
      ]
    }
  }'
Response
{
  "success": true,
  "data": {
    "page_title": {
      "type": "TEXT",
      "value": "Authentication",
      "confidence": 0.98,
      "citations": ["Authentication"],
      "source": "authentication.html"
    },
    "product_area": {
      "type": "TEXT",
      "value": "API Reference",
      "confidence": 0.94,
      "citations": ["API Reference > Authentication"],
      "source": "authentication.html"
    },
    "last_updated": {
      "type": "DATE",
      "value": "2026-03-15",
      "confidence": 0.91,
      "citations": ["Last updated: March 15, 2026"],
      "source": "authentication.html"
    },
    "api_version": {
      "type": "TEXT",
      "value": "v2",
      "confidence": 0.89,
      "citations": ["API v2"],
      "source": "authentication.html"
    },
    "prerequisites": {
      "type": "ARRAY",
      "value": [
        {
          "prerequisite": {
            "type": "TEXT",
            "value": "An active API key",
            "confidence": 0.95,
            "citations": ["You need an active API key"],
            "source": "authentication.html"
          }
        }
      ],
      "confidence": 0.95,
      "citations": ["Prerequisites: An active API key"],
      "source": "authentication.html"
    }
  },
  "metadata": {
    "url": "https://docs.example.com/api/authentication"
  }
}
Request
import { IterationLayer } from "iterationlayer";

const client = new IterationLayer({ apiKey: "YOUR_API_KEY" });

const metadata = await client.extractWebsite({
  file: {
    type: "url",
    url: "https://docs.example.com/api/authentication",
  },
  schema: {
    fields: [
      { name: "page_title", type: "TEXT", description: "The title of the documentation page" },
      { name: "product_area", type: "TEXT", description: "The product area or API category this page covers" },
      { name: "last_updated", type: "DATE", description: "The visible last-updated date, if shown on the page" },
      { name: "api_version", type: "TEXT", description: "The API version this page documents, if specified" },
      {
        name: "prerequisites",
        type: "ARRAY",
        description: "Prerequisites or requirements listed on the page",
        fields: [
          { name: "prerequisite", type: "TEXT", description: "A single prerequisite or requirement" },
        ],
      },
    ],
  },
});
Response
{
  "success": true,
  "data": {
    "page_title": {
      "type": "TEXT",
      "value": "Authentication",
      "confidence": 0.98,
      "citations": ["Authentication"],
      "source": "authentication.html"
    },
    "product_area": {
      "type": "TEXT",
      "value": "API Reference",
      "confidence": 0.94,
      "citations": ["API Reference > Authentication"],
      "source": "authentication.html"
    },
    "last_updated": {
      "type": "DATE",
      "value": "2026-03-15",
      "confidence": 0.91,
      "citations": ["Last updated: March 15, 2026"],
      "source": "authentication.html"
    },
    "api_version": {
      "type": "TEXT",
      "value": "v2",
      "confidence": 0.89,
      "citations": ["API v2"],
      "source": "authentication.html"
    },
    "prerequisites": {
      "type": "ARRAY",
      "value": [
        {
          "prerequisite": {
            "type": "TEXT",
            "value": "An active API key",
            "confidence": 0.95,
            "citations": ["You need an active API key"],
            "source": "authentication.html"
          }
        }
      ],
      "confidence": 0.95,
      "citations": ["Prerequisites: An active API key"],
      "source": "authentication.html"
    }
  },
  "metadata": {
    "url": "https://docs.example.com/api/authentication"
  }
}
Request
from iterationlayer import IterationLayer

client = IterationLayer(api_key="YOUR_API_KEY")

metadata = client.extract_website(
    file={
        "type": "url",
        "url": "https://docs.example.com/api/authentication",
    },
    schema={
        "fields": [
            {"name": "page_title", "type": "TEXT", "description": "The title of the documentation page"},
            {"name": "product_area", "type": "TEXT", "description": "The product area or API category this page covers"},
            {"name": "last_updated", "type": "DATE", "description": "The visible last-updated date, if shown on the page"},
            {"name": "api_version", "type": "TEXT", "description": "The API version this page documents, if specified"},
            {
                "name": "prerequisites",
                "type": "ARRAY",
                "description": "Prerequisites or requirements listed on the page",
                "fields": [
                    {"name": "prerequisite", "type": "TEXT", "description": "A single prerequisite or requirement"},
                ],
            },
        ]
    },
)
Response
{
  "success": true,
  "data": {
    "page_title": {
      "type": "TEXT",
      "value": "Authentication",
      "confidence": 0.98,
      "citations": ["Authentication"],
      "source": "authentication.html"
    },
    "product_area": {
      "type": "TEXT",
      "value": "API Reference",
      "confidence": 0.94,
      "citations": ["API Reference > Authentication"],
      "source": "authentication.html"
    },
    "last_updated": {
      "type": "DATE",
      "value": "2026-03-15",
      "confidence": 0.91,
      "citations": ["Last updated: March 15, 2026"],
      "source": "authentication.html"
    },
    "api_version": {
      "type": "TEXT",
      "value": "v2",
      "confidence": 0.89,
      "citations": ["API v2"],
      "source": "authentication.html"
    },
    "prerequisites": {
      "type": "ARRAY",
      "value": [
        {
          "prerequisite": {
            "type": "TEXT",
            "value": "An active API key",
            "confidence": 0.95,
            "citations": ["You need an active API key"],
            "source": "authentication.html"
          }
        }
      ],
      "confidence": 0.95,
      "citations": ["Prerequisites: An active API key"],
      "source": "authentication.html"
    }
  },
  "metadata": {
    "url": "https://docs.example.com/api/authentication"
  }
}
Request
import il "github.com/iterationlayer/sdk-go"

client := il.NewClient("YOUR_API_KEY")

metadata, err := client.ExtractWebsite(il.ExtractWebsiteRequest{
    File: il.NewWebsiteFromURL("https://docs.example.com/api/authentication"),
    Schema: il.ExtractionSchema{
        "page_title":   il.NewTextFieldConfig("page_title", "The title of the documentation page"),
        "product_area": il.NewTextFieldConfig("product_area", "The product area or API category this page covers"),
        "last_updated": il.NewDateFieldConfig("last_updated", "The visible last-updated date, if shown on the page"),
        "api_version":  il.NewTextFieldConfig("api_version", "The API version this page documents, if specified"),
        "prerequisites": il.NewArrayFieldConfig("prerequisites", "Prerequisites or requirements listed on the page", []il.FieldConfig{
            il.NewTextFieldConfig("prerequisite", "A single prerequisite or requirement"),
        }),
    },
})
Response
{
  "success": true,
  "data": {
    "page_title": {
      "type": "TEXT",
      "value": "Authentication",
      "confidence": 0.98,
      "citations": ["Authentication"],
      "source": "authentication.html"
    },
    "product_area": {
      "type": "TEXT",
      "value": "API Reference",
      "confidence": 0.94,
      "citations": ["API Reference > Authentication"],
      "source": "authentication.html"
    },
    "last_updated": {
      "type": "DATE",
      "value": "2026-03-15",
      "confidence": 0.91,
      "citations": ["Last updated: March 15, 2026"],
      "source": "authentication.html"
    },
    "api_version": {
      "type": "TEXT",
      "value": "v2",
      "confidence": 0.89,
      "citations": ["API v2"],
      "source": "authentication.html"
    },
    "prerequisites": {
      "type": "ARRAY",
      "value": [
        {
          "prerequisite": {
            "type": "TEXT",
            "value": "An active API key",
            "confidence": 0.95,
            "citations": ["You need an active API key"],
            "source": "authentication.html"
          }
        }
      ],
      "confidence": 0.95,
      "citations": ["Prerequisites: An active API key"],
      "source": "authentication.html"
    }
  },
  "metadata": {
    "url": "https://docs.example.com/api/authentication"
  }
}

This extracted metadata flows directly into your chunk records. Instead of guessing the product area from the URL path or using document parsing heuristics on a “Last updated” string from the HTML, you get typed, validated fields with confidence scores. The product_area field becomes a retrieval filter. The last_updated date becomes a freshness signal. The prerequisites array becomes context the LLM can use when generating answers.

Two APIs, One Pipeline

The ingestion workflow for a single documentation page becomes:

  1. Convert the page to markdown with Document to Markdown — clean body text for chunking and embedding
  2. Extract metadata with Website Extraction — typed fields for indexing, filtering, and citations
  3. Chunk the markdown by heading hierarchy, attaching extracted metadata to every chunk
  4. Embed and store in your vector database

Both API calls use the same auth token and the same credit pool. The first gives you content for retrieval. The second gives you structure for filtering and display. Together, they turn an HTML page into index-ready chunks with reliable metadata — without writing a custom parser for every documentation site’s layout.

Stale Chunks Are Worse Than Missing Chunks

The lifecycle problem with public docs is unique: you do not know when the source changes, and you have no webhook to tell you.

When a page is updated, your index still contains the old version. When a page is deleted, your index still serves answers from it. When a product deprecates a feature, your RAG system confidently recommends the deprecated approach — because the old chunk is still in the vector database with a high similarity score.

This is worse than having no answer at all. A missing answer prompts the user to search elsewhere. A confident wrong answer from stale docs erodes trust in the entire system.

Build a Refresh Pipeline, Not a One-Time Crawl

The initial crawl is the easy part. The hard part is keeping the index aligned with the source over weeks and months.

A reliable refresh pipeline needs:

Retention Is Not Just a Compliance Question

Beyond freshness, think about what you keep and for how long.

RAG systems become accidental archives. A page is fetched once, embedded, and then lives in a vector database indefinitely — long after the source page changed, moved, or was removed on purpose. The source owner may have removed content for a reason: a security issue, a legal retraction, a product pivot. Your system should not keep serving that content.

Public docs can also include personal data that triggers retention obligations: author names, contributor emails, support contact paths, example data with real-looking identifiers. Even though the pages are public, storing that data indefinitely in your own systems has compliance implications — especially under GDPR, where the original publication does not automatically give you a separate lawful basis for indefinite storage in a different context.

Set clear policies:

Attribution Is Not Optional

A RAG answer without a citation is an assertion without evidence. For public documentation, where the whole point is that the source is authoritative, stripping attribution defeats the purpose.

When a developer or support agent asks a question and gets an answer derived from public docs, they need to know:

This is not just a UX nicety. It is what separates a useful tool from a liability. An answer that cites “Stripe API Reference > Authentication > API Keys, retrieved April 20, 2026” is verifiable. An answer that says “you should use API keys” with no source is indistinguishable from a hallucination.

Good attribution requires ingestion-time metadata. If you throw away the source URL and heading path during conversion, you cannot reconstruct reliable citations later. This is why the metadata extraction step matters — it captures the citation anchors before you lose them.

Design the response format around citations from the beginning. Adding them later means re-processing your entire index.

Build the Pipeline Around Source Truth

Public documentation is valuable because it is authoritative. Your RAG system should preserve that authority instead of turning docs into anonymous text chunks.

The pipeline that works:

  1. Check robots.txt and terms before you ingest anything
  2. Convert pages to clean markdown that preserves headings, code, and tables
  3. Extract typed metadata for indexing, filtering, and citations
  4. Chunk by documentation structure, not token windows
  5. Refresh on a schedule, hash-diff to minimize re-embedding costs
  6. Delete chunks when source pages disappear
  7. Cite the source URL and section in every answer

For implementation, read the Document to Markdown docs for page-to-markdown conversion and the Website Extraction docs for schema-based metadata extraction from public pages. Both APIs accept website URLs and handle HTML cleanup, JavaScript-rendered pages, and structured output — same auth, same credit pool.

Written by
Fabian Schucht Fabian Schucht
Published on
Reading time
16 min read
Categories

Related reading

Learn how to turn the same pattern into production-ready document, image, and automation workflows.

Try with your own data

Start the trial and run this in minutes.