The Problem with AI Extraction
AI-powered document parsing is impressive in demos. Upload a PDF, get structured JSON, applaud. But demos don’t ship to production. Production needs to answer a different question: how do I know when the extraction is wrong?
Traditional OCR and extraction tools give you data without context. A regex parser returns a match or nothing. A template parser returns a value from a bounding box. Neither tells you whether the result is reliable.
When the extraction is wrong — and at scale, it will be wrong sometimes — you find out downstream. A wrong invoice total breaks a payment run. A wrong contract date triggers incorrect compliance alerts. A wrong address sends a shipment to the wrong city.
The Document Extraction API returns a confidence score between 0.0 and 1.0 for every extracted field. This changes the architecture of your pipeline from “trust everything” to “trust proportionally.”
What Confidence Scores Look Like
Every field in the response includes a confidence value:
{
"invoiceNumber": {
"type": "TEXT",
"value": "INV-2026-4521",
"confidence": 0.97
},
"vendorName": {
"type": "TEXT",
"value": "Meridian Supply Co.",
"confidence": 0.94
},
"totalAmount": {
"type": "CURRENCY_AMOUNT",
"value": 3847.50,
"confidence": 0.96
},
"shippingAddress": {
"type": "ADDRESS",
"value": {
"street": "42 Innovation Drive",
"city": "Austin",
"region": "TX",
"postal_code": "78701",
"country": "US"
},
"confidence": 0.82
}
}
The invoice number and total have high confidence — the parser is very sure about these values. The shipping address is lower — maybe the scan quality was poor in that area, or the formatting was ambiguous. Your code can treat these differently.
Building a Threshold-Based Pipeline
The most common pattern is three tiers:
const HIGH_CONFIDENCE_THRESHOLD = 0.92;
const LOW_CONFIDENCE_THRESHOLD = 0.70;
const processField = (fieldName: string, fieldResult: FieldResult) => {
if (fieldResult.confidence >= HIGH_CONFIDENCE_THRESHOLD) {
// Auto-accept: write directly to database
return { action: "accept", field: fieldName, value: fieldResult.value };
}
if (fieldResult.confidence >= LOW_CONFIDENCE_THRESHOLD) {
// Review: pre-fill the form, ask a human to confirm
return { action: "review", field: fieldName, value: fieldResult.value };
}
// Reject: require manual entry
return { action: "manual", field: fieldName, value: null };
};
- Above 0.92 — auto-accept. The extraction is reliable enough to write directly to your database.
- Between 0.70 and 0.92 — flag for review. Pre-fill the value in a review UI so the human only needs to confirm or correct.
- Below 0.70 — require manual entry. The parser couldn’t extract the field reliably.
The thresholds depend on your domain. Financial data might warrant a higher bar (0.95). A content aggregation pipeline might accept 0.80. The point is that you have the data to make that decision per field, per document.
Per-Field, Not Per-Document
Confidence scores are per field, not per document. A single document might have five fields at 0.95+ and one field at 0.72. You auto-accept the five high-confidence fields and only route the one uncertain field for review.
This is dramatically more efficient than reviewing entire documents. A human reviewer sees one pre-filled field that needs confirmation instead of re-checking the whole document.
Practical Confidence Patterns
Invoice processing. Auto-accept invoice number, date, and vendor name (usually high confidence). Flag line item totals for review when they fall below your threshold. Use CALCULATED fields to cross-check: if the computed subtotal doesn’t match the extracted subtotal, both get flagged.
Resume screening. Auto-accept name and email (high confidence). Flag skills and experience summaries (often medium confidence due to varied formatting). Require manual review for contact details extracted from scanned documents.
Contract analysis. Auto-accept party names and effective dates. Flag clause summaries (TEXTAREA fields have more room for partial extraction). Flag boolean fields like “has non-compete” when confidence is below 0.85 — the business impact of a wrong answer is high.
Monitoring Confidence Over Time
Track your confidence distributions. If average confidence drops, something changed — maybe a vendor updated their document format, or scan quality degraded. If average confidence is consistently above 0.95, you might be able to tighten your auto-accept threshold and reduce the review queue.
// Log confidence distributions for monitoring
const confidenceValues = Object.entries(data).map(
([fieldName, result]) => ({
field: fieldName,
confidence: result.confidence,
action: result.confidence >= 0.92 ? "accepted" : result.confidence >= 0.70 ? "review" : "manual",
})
);
Per-Field Threshold Strategies
Not all fields deserve the same threshold. A wrong invoice number is annoying but correctable. A wrong payment amount triggers a wrong payment. Set thresholds based on the cost of getting it wrong:
const THRESHOLD_BY_FIELD: Record<string, number> = {
invoiceNumber: 0.90,
vendorName: 0.88,
invoiceDate: 0.90,
lineItems: 0.85,
subtotal: 0.95,
taxAmount: 0.95,
totalDue: 0.95,
currency: 0.90,
};
const processFields = (data: Record<string, FieldResult>) =>
Object.entries(data).map(([fieldName, result]) => {
const threshold = THRESHOLD_BY_FIELD[fieldName] ?? 0.90;
if (result.confidence >= threshold) {
return { action: "accept", field: fieldName, value: result.value };
}
return { action: "review", field: fieldName, value: result.value };
});
Financial fields (subtotal, tax, total) get a higher bar. Descriptive fields (vendor name, line item descriptions) get a lower one. This reflects the real-world cost of errors — a wrong vendor name is a minor annoyance, a wrong total is a financial discrepancy.
Building a Review Queue
The review tier is where confidence scores pay off most. Instead of dumping uncertain documents into a general inbox, build a targeted review queue that shows reviewers exactly which fields need attention.
A review queue entry should include: the original document (as a link or thumbnail), the extracted values for all fields, the confidence score next to each value, and a visual indicator (green/yellow/red) based on your thresholds. The reviewer confirms or corrects only the flagged fields and approves the rest in one click.
This is dramatically faster than full manual review. A reviewer who needs to re-examine one field out of eight spends a fraction of the time compared to reviewing the entire document from scratch.
Adjusting Thresholds Over Time
Start with conservative thresholds — high auto-accept bars and generous review windows. Track what reviewers actually change. If 98% of fields in the review tier get confirmed without changes, your threshold is too low. Raise the auto-accept bar and shrink the review queue.
If reviewers are frequently correcting fields that were auto-accepted, your threshold is too high. Lower it and catch more questionable extractions before they hit your database.
The goal is to find the point where auto-accepted fields are correct often enough to be trusted, and the review queue is small enough that reviewers handle it without being overwhelmed.
The Business Case
Manual document processing costs time. Full automation without verification costs trust. Confidence-based automation gives you both — speed for the clear cases, human oversight for the ambiguous ones.
For a team processing 500 invoices per month: if 85% of fields are auto-accepted, 12% are pre-filled for quick review, and 3% require manual entry — that’s a fraction of the manual effort with higher accuracy than either full-manual or full-auto approaches.
Get Started
Check the docs to see how confidence scores work across all 17 field types. The TypeScript and Python SDKs return typed response objects with confidence scores on every field.
Sign up for a free account — no credit card required. Run a few documents through with your schema and check the confidence distributions before building your thresholding logic.