Build a Resume Parser That Actually Works — From PDF to Structured Profile

8 min read Document Extraction

Resume Parsing Is a Solved Problem (It Isn’t)

Every ATS and recruiting platform needs to parse resumes. It sounds simple — names, emails, job titles, companies, dates. But resumes are one of the least standardized document types on earth.

One candidate uses a two-column layout. Another uses a single column with horizontal rules. A third exports from LinkedIn as a PDF with embedded images and non-selectable text. A fourth uploads a DOCX with custom formatting. A fifth sends a scan of a printed resume.

Template-based resume parsers break on layout variation. Regex parsers break on format variation. Both produce garbage when they encounter a resume style they haven’t seen before — and they do this silently.

The Document Extraction API uses schema-based extraction. You define the fields you want — name, contact info, experience, skills, education — and the parser finds them in any resume format.

The Resume Schema

import { IterationLayer } from "iterationlayer";

const client = new IterationLayer({ apiKey: "YOUR_API_KEY" });

const { data } = await client.extract({
  files: [
    { type: "base64", name: "resume.pdf", base64: resumeBase64 }
  ],
  schema: {
    fields: [
      {
        name: "full_name",
        type: "TEXT",
        description: "Candidate's full name",
        is_required: true,
      },
      {
        name: "email",
        type: "EMAIL",
        description: "Candidate's email address",
      },
      {
        name: "phone",
        type: "TEXT",
        description: "Candidate's phone number",
      },
      {
        name: "location",
        type: "ADDRESS",
        description: "Candidate's location or address",
      },
      {
        name: "summary",
        type: "TEXTAREA",
        description: "Professional summary or objective statement",
      },
      {
        name: "experience",
        type: "ARRAY",
        description: "Work experience entries",
        item_schema: {
          fields: [
            { name: "company", type: "TEXT", description: "Company or organization name" },
            { name: "title", type: "TEXT", description: "Job title or role" },
            { name: "start_date", type: "DATE", description: "Start date of employment" },
            { name: "end_date", type: "DATE", description: "End date (or current)" },
            { name: "description", type: "TEXTAREA", description: "Role description and achievements" },
          ],
        },
      },
      {
        name: "education",
        type: "ARRAY",
        description: "Education history",
        item_schema: {
          fields: [
            { name: "institution", type: "TEXT", description: "School or university name" },
            { name: "degree", type: "TEXT", description: "Degree type and field of study" },
            { name: "graduation_date", type: "DATE", description: "Graduation or completion date" },
          ],
        },
      },
      {
        name: "skills",
        type: "ARRAY",
        description: "Technical and professional skills",
        item_schema: {
          fields: [
            { name: "skill_name", type: "TEXT", description: "Skill name" },
          ],
        },
      },
      {
        name: "languages",
        type: "ARRAY",
        description: "Languages spoken",
        item_schema: {
          fields: [
            { name: "language", type: "TEXT", description: "Language name" },
            { name: "proficiency", type: "ENUM", description: "Proficiency level", values: ["native", "fluent", "advanced", "intermediate", "basic"] },
          ],
        },
      },
    ],
  },
});

Structured Output

{
  "fullName": {
    "type": "TEXT",
    "value": "Sarah Chen",
    "confidence": 0.98
  },
  "email": {
    "type": "EMAIL",
    "value": "sarah.chen@example.com",
    "confidence": 0.97
  },
  "location": {
    "type": "ADDRESS",
    "value": {
      "city": "San Francisco",
      "region": "CA",
      "country": "US"
    },
    "confidence": 0.91
  },
  "experience": {
    "type": "ARRAY",
    "value": [
      [
        { "value": "Stripe", "confidence": 0.96 },
        { "value": "Senior Software Engineer", "confidence": 0.95 },
        { "value": "2023-06", "confidence": 0.92 },
        { "value": "2026-01", "confidence": 0.90 },
        { "value": "Led payments infrastructure team. Reduced payment processing latency by 40%.", "confidence": 0.88 }
      ],
      [
        { "value": "Shopify", "confidence": 0.95 },
        { "value": "Software Engineer", "confidence": 0.94 },
        { "value": "2020-08", "confidence": 0.91 },
        { "value": "2023-05", "confidence": 0.89 },
        { "value": "Built merchant onboarding APIs serving 2M+ merchants.", "confidence": 0.86 }
      ]
    ],
    "confidence": 0.92
  },
  "skills": {
    "type": "ARRAY",
    "value": [
      [{ "value": "TypeScript", "confidence": 0.96 }],
      [{ "value": "Go", "confidence": 0.95 }],
      [{ "value": "PostgreSQL", "confidence": 0.94 }],
      [{ "value": "Kubernetes", "confidence": 0.93 }],
      [{ "value": "gRPC", "confidence": 0.92 }]
    ],
    "confidence": 0.94
  }
}

Why Schema-Based Parsing Works for Resumes

Resumes are creative documents. Candidates choose their own layout, their own section headings (“Experience” vs “Work History” vs “Professional Background”), their own way of listing dates (“Jan 2023 - Present” vs “2023.01 - current” vs “Since January 2023”).

The parser doesn’t depend on headings or layout. It understands the content and maps it to your schema. “Professional Background” and “Work Experience” both get extracted into your experience ARRAY field.

The EMAIL field type validates email addresses. The ADDRESS field decomposes locations into components. The DATE field normalizes date formats. You get clean, structured data regardless of how the candidate formatted their resume.

Handling Different Resume Formats

Resumes arrive in every format your file upload widget accepts — and some it probably shouldn’t.

PDF from Word. The most common format. Text is selectable, layout is preserved. These parse cleanly because the text layer is already there.

PDF from design tools. Candidates who use Canva, Figma, or InDesign create PDFs with complex layouts — multi-column, custom fonts, icons for contact info instead of text. The parser handles these because it reads semantic content, not layout positions. A phone icon followed by a number is still a phone number.

DOCX files. Word documents with tables for layout, custom styles, and sometimes embedded images for headings or section dividers. The parser extracts text from DOCX natively without needing a Word installation.

Scanned resumes. A printed resume that was scanned or photographed. The built-in OCR reads the text before extraction begins. Scan quality matters — a clear scan at 300 DPI produces better results than a phone photo taken at an angle. But even lower-quality scans typically extract names, emails, and company names reliably.

LinkedIn PDF exports. LinkedIn’s “Save to PDF” feature produces a specific layout that many parsers have hardcoded templates for. The schema-based approach doesn’t need a LinkedIn-specific template — it extracts the same fields from a LinkedIn export as from any other resume format.

Extending the Schema for Your Use Case

The base schema covers the universal resume fields. Depending on your industry, you might need additional fields.

For technical roles, add certifications and portfolio links:

{
  name: "certifications",
  type: "ARRAY",
  description: "Professional certifications",
  item_schema: {
    fields: [
      { name: "certName", type: "TEXT", description: "Certification name" },
      { name: "issuer", type: "TEXT", description: "Issuing organization" },
      { name: "dateObtained", type: "DATE", description: "Date obtained" },
    ],
  },
},
{
  name: "portfolioUrl",
  type: "TEXT",
  description: "Portfolio, GitHub, or personal website URL",
},

For academic positions, add publications:

{
  name: "publications",
  type: "ARRAY",
  description: "Academic publications",
  item_schema: {
    fields: [
      { name: "publicationTitle", type: "TEXT", description: "Title of the publication" },
      { name: "journal", type: "TEXT", description: "Journal or conference name" },
      { name: "publicationYear", type: "INTEGER", description: "Year published" },
    ],
  },
},

The schema adapts to what you need. A recruiting agency parsing resumes for nursing positions doesn’t need a publications field. A university hiring committee doesn’t need a certifications field. Define the fields relevant to your pipeline and skip the rest.

Batch Processing for Recruiting

A job posting gets 200 applications. Send the resumes in batches of 20 per API call — 10 batch requests process all 200 resumes. The same schema applies to every resume, producing consistent structured data for comparison and screening.

Confidence-Based Screening

Use confidence scores to build your intake pipeline:

  • High confidence (above 0.90) — auto-populate the candidate profile in your ATS
  • Medium confidence (0.75 to 0.90) — pre-fill and flag for recruiter review
  • Low confidence (below 0.75) — require manual entry

For skills and experience descriptions, lower confidence is expected — these sections have more variation. Name and email are typically high confidence.

Building a Scoring Pipeline

Once you have structured resume data, you can build scoring logic on top of it. The API gives you the raw structured data — scoring and ranking are your business logic.

A simple approach: define your requirements as a checklist and score each resume against it.

const requiredSkills = ["TypeScript", "PostgreSQL", "Docker"];
const preferredSkills = ["Kubernetes", "gRPC", "Terraform"];
const minimumYearsOfExperience = 3;

const extractedSkills = data.skills.value.map((row) => row.at(0).value.toLowerCase());

const requiredSkillMatches = requiredSkills.filter((skill) =>
  extractedSkills.includes(skill.toLowerCase())
);

const preferredSkillMatches = preferredSkills.filter((skill) =>
  extractedSkills.includes(skill.toLowerCase())
);

This is basic string matching. For production use, you’d want fuzzy matching — “TypeScript” should match “Typescript”, “TS”, and “typescript/javascript”. But the structured extraction gives you a clean starting point. You’re matching against a flat array of skill names, not trying to find skills in a blob of unstructured text.

What’s Next

Parsed candidate profiles feed directly into Image Generation for candidate summary cards — same auth, same credit pool.

Get Started

Check the docs for the full API reference and ARRAY field documentation. The TypeScript and Python SDKs handle parsing and typed responses.

Sign up for a free account — no credit card required. Parse a few resumes from your pipeline to see how the schema handles your specific candidate documents.

Start building in minutes

Free trial included. No credit card required.