INNOVATE
03Logistics·Caucasus + EU corridor

Replacing 60% of a manual ops team with a document-AI pipeline (and redeploying the rest)

A freight forwarder was drowning in 800+ shipping documents a day across English, Georgian, Russian, and Turkish. Seven months later: 95% automated, 0.3% error rate, ops team redeployed to exception handling and customer success.

70%

less manual processing time

Across BoLs, customs forms, multilingual invoices

Client

Mid-sized freight forwarding & customs brokerage

Timeline

7 months to production + ongoing monthly retainer

Team

4 engineers (2 ML, 2 full-stack), 1 designer

Engagement

Discovery → POC → phased production rollout → monthly improvement retainer

01 — Challenge

The situation we walked into.

A mid-sized freight forwarding and customs brokerage had a 12-person ops team whose entire day was data entry — pulling fields off bills of lading, customs declarations, commercial invoices, and certificates of origin, then keying them into the routing system. Volume had doubled in two years; the team was at breaking point. Errors were costing real money — incorrect HS codes alone were generating customs holds, fines, and merchant chargebacks.

  • 01800+ inbound documents per day across BoLs, customs declarations, commercial invoices, packing lists, and certificates of origin.
  • 02Multilingual sources — English, Georgian, Russian, and Turkish — with mixed handwritten and printed content.
  • 032-day average backlog during peak season; 4% data-entry error rate generating roughly $40K/month in customs holds and chargebacks.
  • 04Hiring more ops staff was no longer working — training time was 3 months and retention was poor for a job nobody wanted.
  • 05Existing TMS (transport management system) had a workable API but no plan for automated ingestion.
02 — Approach

What we actually did, in order.

We treated this as a workflow-automation problem with AI inside it, not as an 'AI project.' The hardest decisions were about where humans stay in the loop, not which model to use.

01

Process discovery on the ops floor

Spent two weeks with the ops team — sitting next to them, watching what they actually did, not what the org chart said they did. Found seven document types and 23 distinct field-extraction patterns. About 40% of their time was spent on judgement calls (HS code disambiguation, route exceptions); 60% was pure transcription.

02

Pipeline architecture

Vision OCR (AWS Textract) for raw document scanning, LLM extraction (GPT-4o + Claude as fallback) for structured field parsing, RAG over the client's historical filings for context-aware classification (especially HS code disambiguation), and a review queue for anything the pipeline scored below confidence threshold.

03

Human-in-the-loop review console

Built a Next.js review console where the ops team could approve, correct, or reject the pipeline's output in seconds. Every correction fed back into a quality dataset that improved retrieval and prompt context — the system got measurably better month over month from real-world feedback, not synthetic training data.

04

Confidence-aware routing

High-confidence extractions ($>$0.95 across all fields) routed automatically to the TMS. Mid-confidence (0.85–0.95) flagged for one-click ops review. Low-confidence routed to full manual review. The thresholds are tunable per document type and have been adjusted twice since launch.

05

Phased rollout by document type

BoLs first (highest volume, most standardized). Customs declarations second. Multilingual invoices last. Each phase ran in parallel with manual processing for two weeks before cutover, with daily accuracy comparisons.

06

Ops team redeployment

We worked with leadership on the team transition early — three people moved into customer success, four into exception handling, three into a new compliance review function, two left for unrelated reasons. No layoffs were necessary, which was an explicit goal of the engagement.

03 — Stack

What it was built on.

Full technology stack
GPT-4oAnthropic Claude 3.5 SonnetAWS TextractPython (FastAPI)PostgreSQL with pgvectorRedisCeleryNext.js (review console)TypeScriptDockerAWS ECSOpenTelemetry
04 — Results

The numbers we will stand behind.

70%

reduction in total manual processing time

0.3%

data-entry error rate

down from 4%

95%

of documents processed without human touch

8h

peak-season backlog

down from 2 days

80%

fewer customs hold incidents post-launch

05 — Outcome

What changed for the business.

The financial picture flipped within three months. The ~$40K/month in customs holds and chargebacks dropped to under $8K. The freed ops capacity moved into customer success — a function the company genuinely needed but had never been able to staff. Customer satisfaction scores moved up measurably and they won two RFPs in the following year that explicitly cited their document-handling speed as a deciding factor.

The pipeline continues to improve. Every ops correction feeds the retrieval layer and the prompt context library. We run a quarterly review with the operations director to retune confidence thresholds based on actual performance, and add new document types (we are currently piloting Letters of Indemnity).

The most important outcome was cultural. The ops team — who could have been hostile to a project that automated most of their old job — became its biggest advocates, because nobody lost their job, the work that remained was more interesting, and the system listens to them.

06 — Timeline

How the engagement ran.

Our delivery process

01

Discovery & process mapping

2 weeks

Time on the ops floor, document-type taxonomy, field-extraction backlog, ops-team transition planning with leadership.

02

POC on bills of lading

6 weeks

End-to-end pipeline on the highest-volume document type. Demonstrated >95% accuracy on a held-out test set of 500 docs.

03

Production hardening

8 weeks

Review console, confidence routing, monitoring, security review, TMS integration.

04

Phased rollout

10 weeks

BoLs → customs declarations → multilingual invoices, two-week parallel-running per document type.

05

Ops redeployment & handover

4 weeks

Team transitions, runbook authoring, on-call training, performance baselining.

07 — FAQ

What we get asked about this engagement.

Why two LLM providers instead of one?+
Operational resilience. Both GPT-4o and Claude 3.5 Sonnet pass the accuracy bar for this workload, but they fail differently — when one degrades or has an outage, the other tends to be fine. The pipeline routes between them transparently, with per-provider fallback chains. No single vendor outage has caused a customer-visible disruption since launch.
How accurate does the system need to be before automatic processing kicks in?+
0.95 confidence across all fields in the document, where the confidence is a calibrated score derived from both the OCR and the LLM extraction. We deliberately set the threshold conservatively at launch — most documents fell into mid-confidence and got reviewed. As the retrieval layer improved with real-world data, the auto-process rate climbed naturally from 60% at launch to the current 95%.
What happens when a model gets a HS code wrong?+
If it gets caught at the confidence threshold (most cases), it goes to ops review and never reaches the customs system. If it slips through (rare), the customs broker's downstream validation catches it and creates a correction event. That correction feeds back into the RAG layer so the same mistake becomes less likely going forward. The system fails forward, not backward.
Could you have done this without the human-in-the-loop console?+
Yes, but it would have been a worse system. The console is what makes the pipeline learn from operations in real time — and it is what made the ops team trust the project. A black-box automation would have been politically rejected by the people who knew the work best.

Have a similar problem?

Most engagements start with a 30-minute discovery call. No pitch deck, no NDAs on day one — just an honest conversation about your situation.

Schedule a Call