Mixed Documents Need Mixed Representations
Many document workflows start with a false simplification: this upload is a PDF, so it needs one PDF extraction strategy.
Then the file arrives.
The first two pages are a structured form. The next five pages are invoices with tables. Then there is a narrative explanation, a signed approval page, a few photos, and a contract excerpt with dense paragraphs. The user thinks of it as one submission. The storage layer thinks of it as one file. But the content inside it is not one thing.
If every page is treated the same, the workflow loses meaning.
Forms, tables, and free text carry information differently. A form asks for named fields. A table repeats rows. A narrative section preserves context through paragraphs, headings, and argument structure. Forcing all three into the same representation creates awkward output: prose squeezed into JSON fields, tables flattened into unreliable text, checkboxes hidden in Markdown, or form fields buried in a blob that a downstream system has to parse again.
The better question is not “How do we extract this document?” It is “What representation does each part of this document need?”
That starts with recognizing why messy forms need different handling than tables or narrative sections.
Forms Want Fields Because Decisions Need Names
Forms are built around named values.
An application usually needs specific facts: applicant name, date of birth, consent status, requested amount, policy number, member ID, signature date, tax status, or selected benefit. Those facts often drive workflow decisions. Create the case. Route for review. Confirm eligibility. Generate a response. Block the next step until consent is present.
For that kind of content, typed fields are the right shape. The field name should match the destination system, not necessarily the label on the page. A checkbox should become a business decision such as has_signed_consent, not a vague mark. A date should say which date it is. A missing value should remain missing, not become an empty string that looks intentional.
The hard part of form extraction is not only reading boxes. It is deciding whether a field can safely drive the next step. A low-confidence optional note may be acceptable. A low-confidence payment authorization should stop the workflow. A form section that is mostly clear may create a partial record while one field waits for review.
That makes form extraction a state problem as much as a parsing problem. The output should preserve confidence, citations, review status, and approved values. Otherwise the workflow has no way to distinguish “the user did not provide this” from “the extractor was unsure” from “a reviewer corrected this later.”
Tables Want Arrays Because Rows Repeat Meaning
Tables are different. They are built around repeated records.
An invoice contains line items. A bank statement contains transactions. A supplier catalog contains SKUs. A compliance report contains findings. A budget packet contains variance rows. The workflow usually needs a list of business rows, not a visual cell grid.
This distinction matters because the visible table is often not the same as the operational row. Section headings may apply to several rows. Totals may look like rows but should be extracted separately. A description may wrap across two visual lines. A note outside the table may define the currency for all amounts. A subtotal may be useful for reconciliation but dangerous if imported as a transaction.
Arrays let the workflow model repeated records directly. Each item can have typed fields, confidence, source citation, and review state. Summary values can live outside the array. Rows that need review can be separated from rows that are ready to import.
This is especially important when the output is a spreadsheet. A clean workbook should be generated from approved rows, not from a raw grid that still needs interpretation. The import tab can contain the accepted records. A review tab can preserve uncertainty. A summary tab can show reconciliation totals.
The hard part is not preserving every cell. The hard part is defining what one row means.
Free Text Wants Markdown Because Context Matters
Narrative content does not always want to become JSON.
A policy explanation, contract clause, medical narrative, inspection report, customer statement, or legal analysis may carry meaning through structure. Headings matter. Lists matter. Paragraph order matters. A sentence may only make sense because of the section above it. Turning that into a handful of fields can throw away the very context the next reader or model needs.
Markdown is often a better representation for narrative sections. It preserves readable structure without pretending every fact belongs in a fixed schema. It is useful for search, summarization, RAG, agent context, review, and human handoff.
That does not mean free text should never be structured. Sometimes a contract clause needs a field such as renewal term or governing law. Sometimes a medical note needs diagnosis, medication, or follow-up date. But those fields should be chosen because the workflow needs them, not because every paragraph must become JSON.
There is a tradeoff. Markdown preserves context but does not enforce a database contract. Typed extraction creates a contract but can flatten nuance. Mixed documents often need both: fields for decisions, arrays for repeated records, and Markdown for context the system should preserve rather than over-interpret.
One Upload Can Need All Three
Consider a supplier onboarding packet.
The first page is a form with supplier identity, tax status, and consent. The next pages contain a product catalog table with SKUs, prices, and minimum order quantities. The contract includes paragraphs about renewal, termination, liability, and jurisdiction. At the end, there is a signed approval page.
Treating that packet as one extraction target creates a mess. A giant schema has to carry form fields, table rows, contract clauses, signatures, summaries, and review metadata. The result may look complete, but it becomes hard to validate and harder to maintain.
A better design uses different representations for different evidence:
Form-like pages become typed fields for supplier identity and consent. Table-heavy pages become arrays of catalog rows or finance rows. Narrative contract sections become Markdown for review and agent context, with a few structured fields extracted only when they drive workflow decisions. Approved data can then feed generated outputs: an onboarding summary PDF, a finance workbook, or an internal checklist.
The packet remains one case. The representations are different because the evidence is different.
This is where Iteration Layer fits. Document Extraction is schema-based, so one schema can ask for named form fields and another can ask for arrays of table rows. You do not have to flatten every page into text and then write a second parser. The schema describes the shape the workflow needs.
Document to Markdown preserves readable structure for narrative content. Sheet Generation and Document Generation can produce controlled outputs from approved workflow data. Your application still owns routing, review policy, validation, and storage.
The benefit is not that one API call magically understands every business process. The benefit is that the workflow can choose the right representation for each part without forcing everything through one shape.
Intake Should Route Before Processing When Possible
Mixed documents become easier when the intake layer does some classification before extraction.
If the application already knows a file is a form, route it to field extraction. If it knows a section contains line items, model rows. If the document is mostly narrative, convert it to Markdown. If a file is unsupported or clearly wrong, reject it before it contaminates the case.
The classification does not have to be perfect. It just needs to reduce obvious mismatches. A table-heavy invoice should not be treated like prose. A contract should not be forced into a table model. A checkbox page should not be stored only as Markdown if the checkbox controls consent.
There are cases where intake cannot know enough. A concatenated packet may need a decomposition step before routing. A low-quality scan may need human review before any extraction is useful. A document may contain a table embedded in a narrative section, and both representations may be needed.
The point is to make routing explicit instead of accidental. Every processing step should have a reason tied to the content shape and the workflow decision it supports.
Do Not Let One Representation Become a Dumping Ground
Mixed-document systems often fail by overusing whichever representation worked first.
If the first successful prototype used JSON fields, the team keeps adding fields until narrative context becomes awkwardly chopped into fragments. If the first prototype used Markdown, the team leaves business-critical values buried in text and writes a second parser later. If the first prototype used table extraction, the team tries to force form sections into row-like structures.
Each shortcut creates downstream work.
Fields are good when the destination needs named values. Arrays are good when the destination needs repeated records. Markdown is good when the destination needs readable context. Generated PDFs and spreadsheets are good when approved data needs to be delivered in a human-friendly artifact.
The representations can coexist. In fact, they usually should. What matters is that each one has a clear purpose.
Where a Single Strategy Still Works
Not every document needs mixed extraction.
If you have one clean form and only need named fields, use field extraction. If you have one spreadsheet-like table and only need rows, model arrays. If the document is a report for search or summarization, Markdown may be enough. Adding multiple representations when one is sufficient creates complexity without payoff.
The mixed approach matters when a workflow has multiple content shapes and multiple downstream needs. Claims packets, supplier onboarding, lending packets, audits, compliance reviews, insurance files, and case handoffs often fall into that category.
The tradeoff is coordination. Your application needs to track which representation came from which source, which fields were approved, and which generated outputs depend on which data. That state is work. But it is usually less work than pretending every page can be safely forced into one output format.
Start by Marking the Document
Take one mixed document workflow and mark each section before building the pipeline.
This section is a form. This section is a table. This section is narrative. This page is approval evidence. This attachment is unsupported. Then decide the representation for each part: fields, arrays, Markdown, generated output, or rejection.
That exercise usually exposes the real architecture. The problem was never just extracting a PDF. It was preserving the right kind of meaning for each part of the workflow.