Unverified: What Practitioners Post About OCR, Agents, and Tables

A deep dive into the unverified but consistent reports from engineers and practitioners on the real-world challenges of OCR and AI document processing, highlighting the gap between vendor promises and production reality.

About This Report

I spent a month reading engineering forums and practitioner discussion boards instead of vendor press releases. Anonymous posts, unverified credentials, no editorial review. Someone claims to have processed 150,000 handwritten pages. Someone else claims their agent failed silently on day 11. A developer says they replaced $100 per month in API costs with a €2,000 eBay purchase. None of this is verified.

What I can verify is that the same patterns showed up independently across all 22 capability areas on this site. The same complaints, the same workarounds, the same numbers within the same ranges, posted by people who do not appear to know each other. That consistency is either a coincidence or a signal. I am treating it as a signal, with the caveat that forum posts are forum posts.

The Demo Works. Production Does Not.

Someone describing themselves as an operations coordinator writes about testing eight OCR tools on 200+ multilingual shipping invoices. Most destroyed table formatting. Perfectly organized invoices turned into alphabet soup. Adobe Acrobat, Google Docs upload, free online OCR tools all failed to maintain structure. ABBYY delivered better accuracy but felt dated. Weeks spent finding something that worked.

A poster claiming to process 10,000 NASA technical documents, scanned typewriter reports and handwritten notes and propulsion diagrams from the 1950s onward, describes rebuilding their entire pipeline from scratch using vision-language models. Off-the-shelf parsers broke down on the first batch.

An RPA developer describes spending weeks building regex-based document parsing for loan applications. Then rebuilding the entire workflow in two hours using n8n plus a language model.

From our February vendor coverage: Box Extract reported contract processing reduced from 20 minutes to under 2 minutes. UiPath's healthcare launch claimed medical record review dropped from 70 minutes to 6 minutes. If those numbers hold on vendor-selected use cases, they are impressive. The question is whether they hold on yours.

The OCR Fragmentation

Six months ago, a practitioner could name a preferred OCR engine with confidence. Based on what I read, that confidence is gone.

One widely discussed benchmark tested seven solutions on an academic document with footnotes, tables, figures, and equations. Mistral's API ranked first. Marker with Gemini second. Docling third. Tesseract did not place. The discussion that followed was more revealing than the rankings. Nearly every practitioner preferred a different stack. PaddleOCR. MinerU. Qwen2.5-VL with Marker. PyMuPDF4LLM. Each reportedly worked on someone's documents and failed on someone else's.

Posters describing handwriting-heavy workloads report legacy OCR achieves zero useful accuracy on cursive. Cloud OCR from Azure, Google, and AWS reportedly manages 45-50% on handwriting. For these posters, the shift to vision-language models is not optional.

85% on Page One. 65% by Page Three.

Someone describing a 12-month production deployment, not a benchmark but an operation processing 150,000+ handwritten pages, posted accuracy numbers that vendor pitch decks do not contain.

GPT-4.1 reportedly achieved roughly 85% accuracy on clean single-page handwriting. By page three it dropped to 65%. The poster describes the model fabricating data for later pages rather than flagging uncertainty. One inspector's name from page one appeared on page three's extraction where a different inspector had signed.

Claude Sonnet 4 was described as the most consistent at approximately 83% across all pages. But it returned editorial prose when raw field extraction was needed. Ask for structured JSON, get a summary of the document instead.

Gemini reportedly achieved around 84% on clean sections, 70% on messier content. Structured output came back valid JSON sometimes and garbled other times. Multiple posters independently described being surprised by Gemini's OCR quality, particularly on handwritten documents where other frontier models extracted nothing useful.

The €2,000 Stack

One poster documented replacing roughly $100 per month in cloud API costs with a Mac Studio M1 Ultra purchased on eBay for under €2,000. Three AI agents coordinating through Telegram. Qwen 3.5 running at 60 tokens per second. Vision processing, speech, document extraction, all local. Zero cloud dependencies. If the numbers are accurate, the hardware pays for itself in under two years.

This matches a broader pattern in what I read. Developers who need document processing describe building their own pipelines rather than evaluating IDP vendors. Docling, Marker, PaddleOCR, Kreuzberg, MinerU. The open-source tools appear to be good enough for many use cases.

The Hybrid Pipeline Won

The technical consensus, if forum posts reflect consensus, points to a two-stage architecture. A dedicated OCR or layout model converts documents to structured markdown, then a language model handles extraction and reasoning.

Posters describing large-scale deployments, one claims 2.6 million pages in a burst ingestion, report that the hybrid approach beats sending raw images directly to a vision model in both accuracy and cost. End-to-end LLM processing reportedly costs an order of magnitude more than reserving the LLM for extraction logic.

Source: Hacker News