GenAI · Retrieval-Augmented Generation

Production RAG over 40,000 engineering documents, case study.

An industrial manufacturer needed engineers to find answers across decades of datasheets, GD&T drawings, and internal standards without spending half an hour per question. We built a production RAG system with hybrid search, reranking, and grounded-citation answers — deployed behind their SSO.

By Yantrix Engineering · Applied AI StudioPublished May 8, 2026Updated May 12, 20263 min readIndustrial manufacturing

RAG copilot interface showing grounded answer with cited engineering datasheet sources

Overview

Why this study matters

How a hybrid-search RAG system over 40k engineering PDFs and CAD drawings cut average engineer-question turnaround from 35 minutes to 22 seconds, with grounded citations on every answer.

Client: A large Indian industrial-equipment manufacturer

Project Type: GenAI Application + RAG

Industry: Industrial manufacturing

Service Used: GenAI + RAG + MLOps

Results in numbers

What the engagement actually shipped.

95×: Faster than manual search
22 s: Average answer time
0.93: RAGAS faithfulness score
40k: Documents indexed
88%: Inference-cost reduction post-FT

Objectives

What the project needed to achieve

Answer engineering questions with grounded citations to the exact source paragraph
Handle the mixed-modality corpus (PDFs, scanned drawings, spreadsheets, internal wiki)
Cut average question-to-answer time from ~35 minutes to under 30 seconds
Deploy behind SSO with role-based access so engineering data stays governed
Hit a faithfulness score ≥ 0.90 on the client’s held-out evaluation set

Challenge

Engineering constraint

The client’s 200+ engineers spent significant time searching across a 40,000-document corpus — PDF datasheets, scanned GD&T drawings, internal design standards, ECN history, and supplier specs. Off-the-shelf enterprise search returned files; engineers actually wanted answers, with the source paragraph cited. The team had tried a vanilla RAG pilot that retrieved well but hallucinated citations, eroding trust within two weeks. They needed a production-grade system that survived the 40–60% RAG production-failure rate.

Approach

How Yantrix approached the work

01Built an ingestion pipeline that parses PDFs (PyMuPDF + Unstructured), runs OCR on scanned GD&T drawings (PaddleOCR with table reconstruction), and chunks documents semantically rather than by fixed length.
02Set up hybrid retrieval combining BM25 (Elasticsearch) and dense embeddings (bge-large-en-v1.5 in a Qdrant collection), with a Cohere Rerank-v3 cross-encoder reranking the top 40 hits to the top 8.
03Layered an agentic query-decomposition step — GPT-4o (later switched to a fine-tuned Llama 3 70B) decomposes multi-hop questions into sub-queries, retrieves for each, and synthesizes with explicit per-paragraph citations.
04Built a RAGAS-based evaluation harness with 240 question-answer pairs from the client’s domain experts; iterated retrieval, reranker, and prompt design until faithfulness and answer-relevancy both crossed 0.90.
05Deployed behind the client’s SSO with role-based ACLs on the corpus so each engineer sees only documents they’re cleared for.

Outcomes

What improved by the end

Average answer time: 22 seconds versus 35 minutes baseline — ~95× faster
RAGAS faithfulness 0.93, answer-relevancy 0.91 on held-out evaluation set
Source-cited every answer — trust restored within 6 weeks of pilot
Hosted on the client’s own infrastructure with role-scoped access controls
Switched from GPT-4o to fine-tuned Llama 3 70B at month 5 to cut inference cost ~88%

Deliverables

What the client receives

Ingestion pipeline with PDF + OCR + table reconstruction
Hybrid retrieval stack (BM25 + dense + reranker) with documented tuning
Agentic query-decomposition + grounded-citation generation
RAGAS evaluation harness with the client’s 240-pair domain benchmark
Fine-tuned Llama 3 70B adapter + vLLM serving configuration
SSO integration and role-based corpus ACL
Monitoring dashboard tracking retrieval recall@k, faithfulness drift, latency, cost-per-query

Tools used

Stack and tooling

LangChain + LlamaIndex for orchestration
Qdrant for the vector index, Elasticsearch for BM25
bge-large-en-v1.5 embeddings + Cohere Rerank-v3
GPT-4o then Llama 3 70B (fine-tuned, self-hosted on vLLM)
RAGAS for evaluation
PyMuPDF, Unstructured, PaddleOCR for ingestion
FastAPI backend behind the client’s SSO

Impact

Business-level effect

Engineering team time freed equivalent to ~6 FTE per quarter
Decisions across the team converged faster because they referenced the same cited paragraphs
Documentation gaps surfaced — the team is now backfilling standards the RAG couldn’t answer

Conclusion

Production RAG is mostly about the ingestion pipeline, the retrieval stack, and the evaluation harness. The model picks itself once those three are right. We use the same architecture template across knowledge-base RAG engagements; only the corpus and the eval set change.

Next step

Have a corpus your team spends hours searching every week? Send us the document mix and a short list of representative questions — we’ll come back with a RAG scope and a faithfulness target.

Get in touch

Tagged

RAG
GenAI
Hybrid Search
Llama 3
Fine-Tuning
Enterprise AI

Visual results

Key views and intermediate artefacts

Hybrid search RAG pipeline combining BM25 dense embeddings and reranker for engineering documents

Hybrid-search retrieval pipeline

RAG answer interface showing grounded citation linked to source datasheet paragraph

Grounded-citation answer view

Frequently asked questions

Answers from the engagement itself.

Why do most RAG projects fail to reach production?

Three usual causes: (1) retrieval quality is treated as a one-shot setup instead of an iterative loop, so recall stays mediocre; (2) the system can’t cite sources, so users lose trust the first time it hallucinates; (3) there’s no evaluation harness, so the team can’t tell whether a prompt or retrieval change is an improvement. Every engagement of ours starts with the eval harness.

Hybrid search or pure semantic search — which is right for engineering documents?

Hybrid almost always wins. Engineering questions mix proper-noun part numbers (where BM25 dominates) with conceptual queries (where dense embeddings dominate). Running both and reranking with a cross-encoder typically picks up 8–15 percentage points of recall@10 over either alone.

When should I fine-tune the generator instead of just using GPT-4o or Claude?

Switch to a fine-tuned open-weight model when inference cost or data-residency becomes the binding constraint. For this engagement we moved from GPT-4o to fine-tuned Llama 3 70B at month 5 — cost dropped ~88% with no measurable quality loss on the eval set.

How much does a production RAG engagement cost in India in 2026?

Pilot (one corpus, one use case, basic evaluation): ₹4–9 lakh, 6–10 weeks. Production deployment with evaluation harness, SSO integration, and MLOps: ₹15–35 lakh, 4–6 months. Pricing varies with corpus complexity, OCR needs, and the depth of the eval set.

Related case studies

Adjacent proof you can read next.

LoRA fine-tuned CLIP model for industrial part recognition on warehouse photographs

Model Tuning · LoRA / PEFT

LoRA-tuned CLIP for industrial part recognition on a single GPU

Parameter-efficient fine-tuning of OpenCLIP ViT-L/14 with LoRA adapters on 18,000 SKU photos — 97.4% accuracy versus 78.1% zero-shot, 11 hours on a single RTX 4090.

Read case study

MLOps fleet dashboard managing 600 Jetson devices for edge AI inspection cameras

MLOps · Edge AI Fleet

MLOps platform for a 600-device edge AI fleet

End-to-end MLOps platform managing 600 Jetson inspection cameras across 14 sites — median model deploy went from 9 days to 38 minutes, with automatic drift-triggered rollback.

Read case study

Continue exploring

Related blogs, services, and capability pages

Cross-links help readers move from proof into capability and educational content, and they reinforce the crawl path between commercial pages and reference content.

Service pages

From the blog

Let's build

Have a machine to build? Let's scope it together.

Tell us about your project. We'll respond within 1-2 business days with a preliminary scope and timeline — no boilerplate, no up-sell.

Start your project View full portfolio