Turn Messy Supplier Catalogs into Clean Product Data — Automatically - Blog

The Supplier Catalog Problem

You sell products online. Your suppliers send catalogs — some as PDFs, some as spreadsheets, some as scanned paper documents that someone photographed with their phone. Each supplier uses their own format, their own column names, their own way of listing specs.

You need to get this data into your system: product names, SKUs, prices, descriptions, categories, specifications. For one supplier, you could build a custom import script. For fifty suppliers, you need something that handles the variation.

The Document Extraction API uses schema-based extraction. Define the product fields you want, send any catalog format, and get structured JSON back. The same schema works across suppliers regardless of how they format their catalogs.

A Product Catalog Schema

import { IterationLayer } from "iterationlayer";
const client = new IterationLayer({ apiKey: "YOUR_API_KEY" });

const { data } = await client.extract({
  files: [
    { type: "url", name: "catalog.pdf", url: "https://supplier.example.com/catalog-2026.pdf" }
  ],
  schema: {
    fields: [
      {
        name: "products",
        type: "ARRAY",
        description: "List of products in the catalog",
        item_schema: {
          fields: [
            { name: "product_name", type: "TEXT", description: "Product name or title" },
            { name: "sku", type: "TEXT", description: "SKU or product code" },
            { name: "description", type: "TEXTAREA", description: "Product description" },
            { name: "unit_price", type: "CURRENCY_AMOUNT", description: "Price per unit" },
            { name: "category", type: "TEXT", description: "Product category or type" },
            { name: "minimum_order_quantity", type: "INTEGER", description: "Minimum order quantity" },
          ],
        },
      },
      {
        name: "supplier_name",
        type: "TEXT",
        description: "Name of the supplier or manufacturer",
      },
      {
        name: "currency",
        type: "CURRENCY_CODE",
        description: "Currency used in the catalog",
      },
      {
        name: "catalog_date",
        type: "DATE",
        description: "Date or version of the catalog",
      },
    ],
  },
});

The ARRAY field handles the product list — whether the catalog has 5 products or 500. Each product row gets its own nested extraction with all the fields you defined.

Handling Different Formats

Suppliers don’t standardize. One sends a polished PDF with tables. Another sends a CSV export. A third sends a scan of a printed price list. The parser handles all of them:

PDF catalogs — reads tables, extracts structured data from formatted layouts
CSV files — understands column headers and maps them to your schema fields
Scanned documents — built-in OCR reads the text from images before extracting
DOCX files — parses Word tables and formatted text

You use the same schema for every supplier. The parser adapts to the document format.

The Column Name Problem

Every supplier names their columns differently. One calls it “Part Number”, another calls it “SKU”, a third calls it “Article No.”, and a fourth puts it in a column labeled “Ref.” with no further explanation. Your database calls it sku.

With a template-based approach, you’d map each supplier’s column names to your internal fields — a configuration file per supplier that breaks every time they redesign their catalog. With schema-based extraction, the field description does the mapping. “SKU or product code” is broad enough to match “Part Number”, “Article No.”, “Item Code”, and “Ref.” without a per-supplier configuration.

The same applies to prices. One catalog lists “Unit Price”, another lists “Price/EA”, a third has “EUR/piece”, and a fourth puts prices in a column called “Netto” (net price in German). The CURRENCY_AMOUNT type combined with the description “Price per unit” handles all of these and returns a normalized number.

Catalogs Without Tables

Not all catalogs are tabular. Some suppliers send product sheets — one page per product with a photo, a description paragraph, and specs scattered across the page in no particular grid structure. Others send catalogs laid out like magazine pages with products arranged in a visual grid.

The ARRAY field doesn’t require a literal HTML or PDF table. It extracts repeating structures. If a document has 20 product descriptions — each with a name, price, and description — the parser identifies the repeating pattern and extracts each instance as an array row. The document doesn’t need table markup or grid lines.

This matters for scanned catalogs especially. A scan of a printed catalog doesn’t have table structures in the PDF — it’s just an image with OCR text. The parser reconstructs the repeating product entries from the OCR output and maps them to your schema.

From Raw Catalog to Product Database

The structured output maps directly to your product database fields:

{
  "products": {
    "type": "ARRAY",
    "value": [
      [
        { "value": "Industrial LED Panel 60x60", "confidence": 0.95 },
        { "value": "LED-P6060-40W", "confidence": 0.97 },
        { "value": "40W flush-mount LED panel for commercial ceilings. 4000K neutral white, 4800 lumens.", "confidence": 0.91 },
        { "value": 34.50, "confidence": 0.94 },
        { "value": "Lighting", "confidence": 0.92 },
        { "value": 10, "confidence": 0.96 }
      ],
      [
        { "value": "Emergency Exit Sign LED", "confidence": 0.94 },
        { "value": "EX-LED-GN-01", "confidence": 0.96 },
        { "value": "Battery-backed LED exit sign with green pictogram. 3-hour emergency runtime.", "confidence": 0.89 },
        { "value": 22.80, "confidence": 0.93 },
        { "value": "Safety", "confidence": 0.90 },
        { "value": 5, "confidence": 0.95 }
      ]
    ],
    "confidence": 0.93
  },
  "supplierName": {
    "type": "TEXT",
    "value": "Nordic Industrial Supply AB",
    "confidence": 0.97
  },
  "currency": {
    "type": "CURRENCY_CODE",
    "value": "EUR",
    "confidence": 0.95
  }
}

Each product row has individual confidence scores per field. High-confidence products go straight into your database. Low-confidence entries get flagged for review.

Normalizing Product Data Across Suppliers

Different suppliers describe the same product differently. Supplier A lists a cable as “CAT6 Ethernet Cable 3m Blue” while Supplier B lists the same cable as “Patch Cable UTP Cat.6 3 meters, blue”. Your product database needs one canonical representation.

The extraction gets you structured fields. The normalization happens in your code after extraction — but having structured fields makes normalization straightforward. You can match products across suppliers by SKU when available, or by comparing extracted attributes (category + key specs) when SKUs don’t align.

A common pattern is to extract a rich schema with optional fields and use whatever the catalog provides:

const productFields = [
  { name: "product_name", type: "TEXT", description: "Product name or title" },
  { name: "sku", type: "TEXT", description: "SKU, part number, or product code" },
  { name: "description", type: "TEXTAREA", description: "Product description" },
  { name: "unit_price", type: "CURRENCY_AMOUNT", description: "Price per unit" },
  { name: "category", type: "TEXT", description: "Product category or type" },
  { name: "brand", type: "TEXT", description: "Brand or manufacturer name" },
  { name: "weight", type: "TEXT", description: "Product weight with unit" },
  { name: "dimensions", type: "TEXT", description: "Product dimensions (L x W x H)" },
  { name: "material", type: "TEXT", description: "Primary material" },
  { name: "minimum_order_quantity", type: "INTEGER", description: "Minimum order quantity" },
];

Not every catalog includes weight, dimensions, or material. The parser returns what the document contains. Fields that aren’t present in the catalog are omitted from the response. Your import logic handles the optional fields gracefully rather than requiring every supplier to provide identical data.

Batch Processing for Multi-Supplier Imports

Processing catalogs from multiple suppliers? Send up to 20 files in a single request. Each file gets extracted with the same schema, so you can batch catalogs from different suppliers and get consistent structured output for all of them.

Practical Pipeline

Receive supplier catalog (PDF, CSV, scan)
Send to the Document Extraction with your product schema
Route high-confidence products directly to your product database
Queue low-confidence products for human review
Optionally, pipe the product data into the Image Transformation API to resize and optimize supplier product photos for your marketplace

That last step is where composability pays off — the output of one Iteration Layer API becomes the input for another.

Get Started

Check the docs for the full ARRAY field reference and multi-file extraction documentation. The TypeScript and Python SDKs are available for server-side integration.

Sign up for a free account — no credit card required. Start with one supplier catalog and see how the schema handles your specific product data format.

Ingest

Transform

Generate

Categories

Featured

Overview

APIs

Integrations

Turn Messy Supplier Catalogs into Clean Product Data — Automatically