GenAI · Retrieval-Augmented Generation

Production RAG over 40,000 engineering documents, case study.

An industrial manufacturer needed engineers to find answers across decades of datasheets, GD&T drawings, and internal standards without spending half an hour per question. We built a production RAG system with hybrid search, reranking, and grounded-citation answers — deployed behind their SSO.

By Yantrix Engineering · Applied AI Studio3 min readIndustrial manufacturing
RAG copilot interface showing grounded answer with cited engineering datasheet sources

Overview

Why this study matters

How a hybrid-search RAG system over 40k engineering PDFs and CAD drawings cut average engineer-question turnaround from 35 minutes to 22 seconds, with grounded citations on every answer.

Client: A large Indian industrial-equipment manufacturer

Project Type: GenAI Application + RAG

Industry: Industrial manufacturing

Service Used: GenAI + RAG + MLOps

Results in numbers

What the engagement actually shipped.

95×
Faster than manual search
22 s
Average answer time
0.93
RAGAS faithfulness score
40k
Documents indexed
88%
Inference-cost reduction post-FT

Objectives

What the project needed to achieve

  • Answer engineering questions with grounded citations to the exact source paragraph
  • Handle the mixed-modality corpus (PDFs, scanned drawings, spreadsheets, internal wiki)
  • Cut average question-to-answer time from ~35 minutes to under 30 seconds
  • Deploy behind SSO with role-based access so engineering data stays governed
  • Hit a faithfulness score ≥ 0.90 on the client’s held-out evaluation set

Challenge

Engineering constraint

The client’s 200+ engineers spent significant time searching across a 40,000-document corpus — PDF datasheets, scanned GD&T drawings, internal design standards, ECN history, and supplier specs. Off-the-shelf enterprise search returned files; engineers actually wanted answers, with the source paragraph cited. The team had tried a vanilla RAG pilot that retrieved well but hallucinated citations, eroding trust within two weeks. They needed a production-grade system that survived the 40–60% RAG production-failure rate.

Approach

How Yantrix approached the work

  1. 01Built an ingestion pipeline that parses PDFs (PyMuPDF + Unstructured), runs OCR on scanned GD&T drawings (PaddleOCR with table reconstruction), and chunks documents semantically rather than by fixed length.
  2. 02Set up hybrid retrieval combining BM25 (Elasticsearch) and dense embeddings (bge-large-en-v1.5 in a Qdrant collection), with a Cohere Rerank-v3 cross-encoder reranking the top 40 hits to the top 8.
  3. 03Layered an agentic query-decomposition step — GPT-4o (later switched to a fine-tuned Llama 3 70B) decomposes multi-hop questions into sub-queries, retrieves for each, and synthesizes with explicit per-paragraph citations.
  4. 04Built a RAGAS-based evaluation harness with 240 question-answer pairs from the client’s domain experts; iterated retrieval, reranker, and prompt design until faithfulness and answer-relevancy both crossed 0.90.
  5. 05Deployed behind the client’s SSO with role-based ACLs on the corpus so each engineer sees only documents they’re cleared for.

Outcomes

What improved by the end

  • Average answer time: 22 seconds versus 35 minutes baseline — ~95× faster
  • RAGAS faithfulness 0.93, answer-relevancy 0.91 on held-out evaluation set
  • Source-cited every answer — trust restored within 6 weeks of pilot
  • Hosted on the client’s own infrastructure with role-scoped access controls
  • Switched from GPT-4o to fine-tuned Llama 3 70B at month 5 to cut inference cost ~88%

Deliverables

What the client receives

  • Ingestion pipeline with PDF + OCR + table reconstruction
  • Hybrid retrieval stack (BM25 + dense + reranker) with documented tuning
  • Agentic query-decomposition + grounded-citation generation
  • RAGAS evaluation harness with the client’s 240-pair domain benchmark
  • Fine-tuned Llama 3 70B adapter + vLLM serving configuration
  • SSO integration and role-based corpus ACL
  • Monitoring dashboard tracking retrieval recall@k, faithfulness drift, latency, cost-per-query

Tools used

Stack and tooling

  • LangChain + LlamaIndex for orchestration
  • Qdrant for the vector index, Elasticsearch for BM25
  • bge-large-en-v1.5 embeddings + Cohere Rerank-v3
  • GPT-4o then Llama 3 70B (fine-tuned, self-hosted on vLLM)
  • RAGAS for evaluation
  • PyMuPDF, Unstructured, PaddleOCR for ingestion
  • FastAPI backend behind the client’s SSO

Impact

Business-level effect

  • Engineering team time freed equivalent to ~6 FTE per quarter
  • Decisions across the team converged faster because they referenced the same cited paragraphs
  • Documentation gaps surfaced — the team is now backfilling standards the RAG couldn’t answer

Conclusion

Production RAG is mostly about the ingestion pipeline, the retrieval stack, and the evaluation harness. The model picks itself once those three are right. We use the same architecture template across knowledge-base RAG engagements; only the corpus and the eval set change.

Next step

Have a corpus your team spends hours searching every week? Send us the document mix and a short list of representative questions — we’ll come back with a RAG scope and a faithfulness target.

Tagged

  • RAG
  • GenAI
  • Hybrid Search
  • Llama 3
  • Fine-Tuning
  • Enterprise AI

Frequently asked questions

Answers from the engagement itself.

Why do most RAG projects fail to reach production?

Three usual causes: (1) retrieval quality is treated as a one-shot setup instead of an iterative loop, so recall stays mediocre; (2) the system can’t cite sources, so users lose trust the first time it hallucinates; (3) there’s no evaluation harness, so the team can’t tell whether a prompt or retrieval change is an improvement. Every engagement of ours starts with the eval harness.

Hybrid search or pure semantic search — which is right for engineering documents?

Hybrid almost always wins. Engineering questions mix proper-noun part numbers (where BM25 dominates) with conceptual queries (where dense embeddings dominate). Running both and reranking with a cross-encoder typically picks up 8–15 percentage points of recall@10 over either alone.

When should I fine-tune the generator instead of just using GPT-4o or Claude?

Switch to a fine-tuned open-weight model when inference cost or data-residency becomes the binding constraint. For this engagement we moved from GPT-4o to fine-tuned Llama 3 70B at month 5 — cost dropped ~88% with no measurable quality loss on the eval set.

How much does a production RAG engagement cost in India in 2026?

Pilot (one corpus, one use case, basic evaluation): ₹4–9 lakh, 6–10 weeks. Production deployment with evaluation harness, SSO integration, and MLOps: ₹15–35 lakh, 4–6 months. Pricing varies with corpus complexity, OCR needs, and the depth of the eval set.

Related case studies

Adjacent proof you can read next.

Continue exploring

Related blogs, services, and capability pages

Cross-links help readers move from proof into capability and educational content, and they reinforce the crawl path between commercial pages and reference content.

Let's build

Have a machine to build? Let's scope it together.

Tell us about your project. We'll respond within 1-2 business days with a preliminary scope and timeline — no boilerplate, no up-sell.