Trunk Tools' stack cut document review from 60 days to 10 by ditching general-purpose models

Trunk Tools built a specialized three-layer AI architecture to handle complex construction data, cutting document review times from months to days. By moving away from general-purpose LLMs, the company successfully automated highly-specific workflows with high accuracy.
Most verticals aren’t clean, well-oiled SaaS databases; the reality is ugly documents, proprietary schemas, implicit workflows, and long‑running tasks that most general-purpose models struggle with.
This prompted construction project management company Trunk Tools to build a specialized, three-layer architecture — perception, semantics, agents — based on highly-detailed data to support high-accuracy, highly-relevant industry automation.
Their purpose-built stack has shrunk review cycles from months to days, prevented costly field errors, and given autonomous agents the ability to reason over millions of pages of documentation, Trunk says.
“We really set out to take the data from dispersed systems, pre-process it, structure it, go through our ontology into a knowledge graph, and then train AI models,” said Sarah Buchner, Trunk’s founder and CEO and a former carpenter.
For builders in other verticals, Trunk’s approach could serve as a blueprint for transforming data chaos into agent‑ready, industry-specific workflows.
Where general-purpose LLMs break down on industry data
Foundation LLMs, while powerful, are optimized for breadth, not always depth.
“General-purpose LLMs are trained to be okay at everything, so they're weak at anything niche,” said Kriti Faujdar, a senior product manager working in AI infrastructure, agentic AI, security, and LLM platforms. For instance: Rare terms, domain-specific reasoning, the unspoken context that any practitioner “just knows.”
Web, app, and software developer Sébastien De Bollivier agreed that the biggest bottleneck is reliability on data that is “jargon-dense, abbreviation-heavy, and format-specific.”
“A GPT-4-class model can understand a French legal contract, but will fumble the specific article references practitioners need to cite,” he said.
Besides, the most valuable enterprise data never made it into pretraining anyway, Faujdar pointed out. It's sitting in internal systems and proprietary formats. “RAG helps a little,” she said. “But it's just giving better facts to a model that still can't reason properly in the domain.”
Pre-training on domain data is critical; enterprises should then fine-tune on good task examples and build their own evals. “A few thousand examples from real practitioners beats millions of scraped, noisy ones," Faujdar said.
Mixture-of-experts (MoE) can provide specialization without inference costs blowing up. Pairing RAG with fine-tuning also works well; RAG handles the factual long trail while fine-tuning fixes vocabulary and reasoning.
De Bollivier pointed to the advantage of hybrid stacks: A general-purpose model for reasoning and orchestration, a smaller fine-tuned model (or dense retrieval over a curated corpus) for domain-specific extraction. He advised: “Don't fine-tune to make the model 'smarter' about a domain, fine-tune to make it more reliable on the specific output format your workflow requires.”
The trades and construction are certainly industries seeing traction with these techniques, as are legal and healthcare, De Bollivier said. These verticals have “high stakes for errors plus standardized document formats, equaling clear domain-training ROI.”
One honest caveat worth mentioning, Faujdar said: Specialized models can often fall apart outside their domain, so they’re often not useful outside their expertise (unless they’re re-trained).
Perception, semantics, agents: inside Trunk's three-layer stack
In highly-specialized domains like construction, “data dumps” into large language models (LLMs) don’t cut it, said Trunk’s CTO Amrish Kapoor. This is because most transformers are probabilistic models: When given an image, they report back that it is “probably” a tree, or “probably” a child playing next to a tree.
This makes them insufficient for high‑precision symbolic interpretation. For instance, in construction documents, a 2-millimeter-wide symbol has a vastly different meaning depending on where it’s placed.
Further, constrained by context limits, probabilistic models struggle with long‑term project memory. “I don't mean a context window of a few tokens,” Kapoor said. “I'm talking about long term memory that stretches across months and years, because this is how long some of these projects are.”
Instead, Trunk’s three-layer system breaks workflows into:
Perception (reading and extracting data from messy docs like PDFs, drawings, or scans)
A semantic/graph layer (making sense of that data and understanding their relationships).
LLMs and agents on top.
Construction drawings are typically symbolic, Buchner said. A door isn't always labeled ‘door.’ Sometimes it's simply an arc on a wall that a trained eye learns to read based on years of practice.
“The perception layer is what teaches AI to read that language,” she said. The semantic layer then gives that information meaning; for instance, connecting the door to the drawing that details it, the spec that governs it, and the trade that installs it. This helps answer project engineers’ critical questions: Not "is there a door here?" but "does this door create a problem down the line?"
Particularly in construction, that shift matters because the cost of a problem compounds with time. “A conflict caught in design is relatively low cost to address,” Buchner said, “whereas the same problem caught in the field might cost tens of thousands of dollars.”
At a high level, the system identifies the document type and begins extracting information based on content (drawing, schedules, paragraph text). This data is then “transformed and augmented” in the platform, which triggers agentic workflows like knowledge graph relationships and end-user workflows.
For instance, an agent might review an architecture bulletin and produce a visual overlay comparing an older version and a newer version (flagging additions and removals), then generate written narratives that describe what those changes are in simple terms. This helps users understand what’s changed and coordinate with trade partners on updated pricing and change orders.
The scale of construction’s data problem
Construction workflows are “ripe with implicit assumptions and connections between data in its myriad of sources,” Buchner said. And the amount of unstructured data is “humanly impossible” to process or make sense of.
Buchner estimated the average high-rise building generates about 3.6 million pages of corresponding documentation. “If you print it into a stack of papers it would be as high as the building itself.”
All three layers of Trunk’s stack — perception, semantic, LLM — are trained on “very specific datasets” from customers with “explicit permissions” and auto‑labeling/IP, Kapoor explained. Customers who don’t want Trunk training on their data can opt out.
Data is deidentified and aggregated, and Trunk also collects “tons more” labeled data through other pipelines like 3D building information modeling (BIM).
Trunk says it only ships agents that achieve around 95% accuracy. The team maintains continuous evaluation pipelines based on ground truth data from customers and experts. They also employ an LLMs-as-a-judge model.
“This notion of an LLM as a judge is to score how well you're doing, both subjectively as well as objectively,” Kapoor said. Objectivity can be an easy ‘right’ or ‘not right,’ but subjectivity requires more nuance.
For instance, when creating an email or narrative or explanation, an LLM as a judge framework can create a composite score, or a numerical value that aggregates different metrics and tests a model's performance or risk.
There can be challenges, though, particularly with latency, Buchner noted; any time the reasoning capacity of underlying models increases, the risk of latency goes up, too. Trunk maintains a set of evaluation criteria to objectively measure latency whenever changes are made to underlying infrastructure, agents, and API calls.
Then, “before we release to customers, we ensure marginal changes are thoroughly tested and optimized.”
Source: VentureBeat











