Why most IDP projects fail before they start
The problem is almost never the OCR.
Every intelligent document processing project starts with the same shopping list. Find a good OCR engine: Amazon Textract, Google Document AI, Azure Form Recognizer, or something built on LayoutLM or Donut. Benchmark a few. Pick the best one. Wire it to the document store. Ship.
Six months later, extraction accuracy is stuck at seventy percent, everyone is blaming the model, and the vendor is patiently explaining that yes, their tool does extract ninety-five percent of fields on their benchmark dataset, which is not your dataset. The project quietly dies. Another team at another company tries again next year with a different vendor and gets the same result.
The model was never the problem. The problem is that nobody scoped the work that sits upstream and downstream of the OCR engine, and all that unscoped work rolled downhill onto the extraction accuracy number.
What sinks IDP projects
Three things kill intelligent document processing projects, in roughly descending order of severity.
The first is source document quality. An OCR engine's accuracy is not a fixed number; it is a function of the resolution, angle, lighting, and cleanliness of the source documents. A scan from a shared office copier at the default 200 DPI setting loses roughly ten percent of achievable accuracy before any machine learning enters the picture. Documents that arrive as emailed phone photos of printed forms lose more. If there are staples, coffee stains, handwritten annotations, or fold lines across key fields, no amount of model fine-tuning recovers that signal. In most failed IDP projects, the first productive week is spent not on the model at all, but on upgrading the scanning process. The fix is concrete: 300 DPI minimum, consistent lighting, rotated to vertical, stripped of staples. This feels unglamorous, and it is. It is also the intervention that moves the accuracy needle the most, for the least money.
The second killer is ambiguous field definitions. A team wants to extract 'invoice number' from a pile of supplier invoices. Seems straightforward. Until you look at ten supplier invoices and find: one calls it 'Invoice No.', another prints 'Document Reference', a third has both an internal ID and a customer-facing number and it's unclear which should go in the column. The finance team has been happily treating 'invoice number' as a fuzzy concept for a decade, because humans can tell from context. A model cannot. You can't train a model to extract something the humans sourcing the labels don't agree on, and you definitely can't train one to guess which of two values should go in the field. Every IDP project that gets any traction begins with a painful, unglamorous week of alignment meetings where finance, operations, and data engineering agree, in writing, what each field actually means.
The third killer is the missing downstream consumer. The extracted data gets written to… where? In what schema? With what validation? Most IDP projects are scoped as 'extract fields from documents' and stop there, producing a CSV or JSON payload that someone then has to integrate with the ERP, the CRM, the accounts payable system, or the data warehouse. The format required by those downstream systems is almost never exactly what the model produces, which means every extraction project secretly includes an undisclosed integration project that doubles the timeline. Worse, the downstream system usually has validation rules the extraction pipeline did not know about (date formats, required fields, foreign key constraints), and rejected records pile up silently. An IDP pipeline that ships extraction without testing the full roundtrip to the target system is a pipeline that works in the demo and fails in production.
The practical implication: before any discussion of models, the first question on any IDP engagement should be about scope, not approach. Specifically, have the source documents been sampled and is the scan quality consistent? Are the field definitions agreed in writing by the business users? Is the target system's schema documented, and does at least one end-to-end record flow through before the first model training run?
If those three things are not in place, the project is already in trouble, and no tool on the market will fix that. If they are in place, the choice of OCR engine or layout-aware transformer is almost incidental; most of them will get you to ninety-five percent on cleanly-scoped data.
The tools are a commodity. Knowing what 'extracted' actually means is the job.
# scope/preflight.py: refuse to model until these three are resolved
def validate(scope: IDPScope) -> None:
assert scope.dpi >= 300, "Rescan source documents at 300+ DPI first."
assert all(scope.field_definitions.values()), "Field definitions not agreed in writing."
assert scope.target_schema, "Downstream consumer schema not documented."
assert scope.roundtrip_tested, "End-to-end record flow not tested."All four checks fail without code change to the model. The model isn't where IDP projects fail.