INNOVATE
01AI / RAG

RAG in production: lessons from a document-AI pipeline

Most RAG writeups stop at the demo. Here is what we learned shipping one through a year of real freight-forwarding operations — multilingual documents, multi-provider failover, calibrated confidence, and the human-in-the-loop pattern that actually moved the metrics.

14 min readUpdated April 8, 2026

Retrieval-Augmented Generation has the unusual distinction of being both genuinely transformative for production workloads and almost universally taught at a level of abstraction that does not survive contact with real data. Most RAG content stops at the diagram with three boxes — embed, retrieve, generate — and a working chatbot demo. Real systems live in the gaps between those boxes.

We just spent eleven months building one of those real systems: a document-AI pipeline that ingests 800+ multilingual shipping documents per day for a freight-forwarder client, classifies them, extracts structured fields, and either auto-processes them into the operational system or queues them for human review. The system runs at a 0.3% error rate, processes 95% of documents without a human touch, and turned a 12-person ops team into a 4-engineer-plus-LLM operation that does the same volume in less time.

This post is the field notes — the architectural decisions, the dead ends, the lessons that would have saved us months if we had known them at kickoff. It is opinionated. It is also written for engineering leadership who would actually build something like this, not for a conference talk.

01

The naive RAG architecture, and where it breaks

The textbook architecture has three components. You take your documents, embed them into a vector space, store them in a vector database. When a question comes in, you embed the question, retrieve the top-k similar chunks, paste them into a prompt with the question, and ask the LLM to answer. There is a lot to like about this — it is simple, it generalizes well, it solves the context-window problem.

It also breaks at production scale in seven specific places. We hit five of them in the first eight weeks. Roughly in the order they showed up:

  1. 01Retrieval quality degrades sharply when documents do not look like clean prose — and almost no production document looks like clean prose. Bills of lading are tables with stamped fields. Invoices have line items that contradict the body text. The naive embedding-as-text approach silently smears these structures into a meaningless soup.
  2. 02Multilingual content does not embed into the same neighborhoods. A Georgian bill of lading and an English bill of lading covering the same shipment will live in different parts of the vector space, even though they should be retrieved together for context.
  3. 03The LLM hallucinates structured output — confidently. Asked for an HS code, it will return a plausible six-digit string. Asked the same question twice with the same context, it will sometimes return different ones.
  4. 04Latency is brutal. The naive flow has at least three sequential network hops (embed query, retrieve, generate). At p95, that is well over a second per document. At 800 documents per day with bursts, you cannot run this synchronously in an ops console without users hating you.
  5. 05Cost scales with volume in a way the demo does not warn you about. Embedding storage is cheap; generation is not. A poorly-tuned pipeline will rack up four-figure monthly bills processing low-confidence documents that should have routed to a cheaper path.
  6. 06Monitoring is wholly absent. The default frameworks ship without observability. You will discover that retrieval quality has degraded only when ops complaints arrive.
  7. 07When the model provider has a bad afternoon, your entire pipeline dies. Single-provider dependency is a production risk that gets cheerfully ignored in tutorials.

02

Dual-provider routing as a baseline, not an upgrade

We started the engagement with a single LLM provider for both extraction and classification. Within three weeks we had two outage events of meaningful duration — both during the client's daytime ops window. Neither was the provider's fault in any morally useful sense, but the operational impact was real. After the second one we redesigned around dual-provider routing as a baseline, not as a future enhancement.

The implementation is conceptually trivial — wrap each LLM call in an interface that knows how to route to either provider, with a fallback chain. The interesting design choices are around when to fail over and how to monitor it.

python
class LLMRouter:
    def __init__(self, primary, fallback, latency_budget_ms=8000):
        self.primary = primary
        self.fallback = fallback
        self.latency_budget_ms = latency_budget_ms

    async def extract(self, prompt: str, schema: type[BaseModel]) -> Result:
        try:
            return await asyncio.wait_for(
                self.primary.extract(prompt, schema),
                timeout=self.latency_budget_ms / 1000,
            )
        except (TimeoutError, ProviderError) as e:
            log.warning("primary failed; falling over", error=str(e))
            metrics.failover_total.labels(reason=type(e).__name__).inc()
            return await self.fallback.extract(prompt, schema)

The latency budget is the meaningful design parameter. Set it too tight and you fail over on transient slowness, doubling your cost for no quality gain. Set it too loose and you ride a hung request to its eventual timeout while the user waits. We landed on 8 seconds as the budget for our extraction calls — long enough to absorb normal variance, short enough that fallback is meaningfully faster than waiting.

The two providers we run (GPT-4o and Claude 3.5 Sonnet) have meaningfully different failure modes. They almost never degrade at the same time, which is the whole point. They also have meaningfully different cost profiles for our workload, which means routing 100% of traffic to one of them at random produces a noticeably more expensive bill than letting the router pick based on availability and recent latency.

03

Confidence-aware routing — the unfashionable workhorse

The biggest single architectural decision in the system was treating confidence as a first-class output of the pipeline, with explicit thresholds gating downstream behavior. Nothing about this is novel. Almost no production AI tutorial mentions it.

Every extraction emits a calibrated confidence score per field, derived from a combination of the OCR confidence, the LLM's own self-reported confidence (which is partly garbage but partly useful), and historical accuracy on similar fields. Documents above 0.95 confidence across all fields auto-process to the TMS. Documents between 0.85 and 0.95 route to one-click ops review. Below 0.85 goes to full manual review.

We launched with conservative thresholds (most documents fell into mid-confidence and got reviewed) and let the auto-process rate climb naturally as the retrieval layer improved with real-world feedback. Day one: 60% auto-processed. Six months later: 95%. The thresholds did not change much during that period — what improved was the underlying retrieval quality, which made the calibrated scores more reliable.

04

Human-in-the-loop is the killer feature, not the failure mode

There is a strain of AI thinking — popular in some VC decks, almost never in production engineering — that treats human-in-the-loop as a transitional embarrassment, something to be replaced by full automation as soon as the model gets good enough. This is exactly backwards.

The human-in-the-loop console in our pipeline is not a fallback. It is the mechanism by which the system improves over time. Every correction an ops user makes flows back into the retrieval layer's quality dataset. The same document type with the same correction pattern will tend not to need correction the next month. The system gets measurably better from real-world feedback, not from synthetic training data.

Building the console deliberately well is high-leverage. Ops users will spot patterns in the AI's failures that the engineering team will not, because they have years of context the engineering team does not. The corrections they make are signal — but only if the console makes correcting things fast enough that they actually do it instead of working around the system.

  • Single-keystroke approval for high-confidence extractions (we use Enter — the muscle memory matters).
  • Field-level diffs that highlight exactly what changed when an extraction is corrected, so the ops user does not have to re-read the whole document.
  • Reason codes attached to corrections, which become a categorical feature in the retrieval quality dataset.
  • An undo that actually works — corrections are non-destructive and can be reverted within a session.
  • A shared queue with explicit work-in-progress markers so two ops users do not double-handle the same document.

The console took roughly 25% of the engineering effort on the project. It produced disproportionate returns and we would build it again — sooner, even, if we ran the project over.

05

Monitoring LLM workloads is its own discipline

By the time we hit production, the system was emitting roughly 40 distinct metrics from the LLM and retrieval layer alone. Most of them were obvious in retrospect; almost none were obvious upfront. The categories that mattered most:

Per-provider request metrics

  • Latency p50/p95/p99, broken out by request type and document length.
  • Failure rate by error class — distinguish 429s (over quota) from 503s (provider down) from timeouts (your network) because the response is different for each.
  • Tokens consumed, broken out by input vs. output and by model.
  • Cost per request, computed in real time from the token counts and current provider pricing.

Quality metrics

  • Auto-process rate (% of documents processed without human review). This is your headline metric.
  • Correction rate by field. Fields that get corrected often are the fields the retrieval layer is failing on.
  • Confidence calibration drift. If the calibrated confidence score is no longer matching observed accuracy, the calibration model needs refitting.
  • Disagreement rate between providers. When the two LLM providers disagree on the same document with the same prompt, that is a strong signal that the document is hard.

06

When not to use RAG

We get asked to add RAG to systems where it does not belong roughly as often as we get asked to add it to systems where it does. Three patterns where you should hesitate:

  1. 01When the structured output you want already lives in a structured database. A SQL query against a normalized table is faster, cheaper, and more accurate than asking an LLM to reason over a vectorized version of the same data. RAG is for unstructured or semi-structured content. If your data is already structured, use it.
  2. 02When the document corpus is small and stable. If you have 200 documents and they almost never change, just paste them into the prompt context and skip the retrieval layer entirely. The whole apparatus of embeddings, vector stores, and retrieval is overhead until your corpus exceeds the context window or changes frequently enough to warrant the indexing cost.
  3. 03When the failure mode of being wrong is catastrophic and you do not have a human-in-the-loop budget. RAG systems are statistically going to be wrong sometimes. If your domain cannot tolerate that — for example, dispensing medication doses — then either commit fully to human-in-the-loop or do not use the technology.

07

Closing

The biggest insight from a year of running this in production is structural, not technical. The hard part of production AI is not the models — they are mostly good enough. It is everything around them: the retrieval quality work, the confidence routing, the human-in-the-loop console, the observability, the dual-provider resilience, the quality dataset that improves the system over time. The model is roughly 15% of the engineering effort. The remaining 85% is what determines whether the system actually works in the operational reality of your business.

If you are building one of these systems and want to compare notes, we would genuinely enjoy that conversation. The space is moving fast enough that field reports from real production deployments are still rare.

Want to compare notes?

Most engagements start with a 30-minute discovery call. No pitch deck, no NDAs on day one — just an honest conversation about what you are building.

Schedule a Call