
LoRA-tuned CLIP for industrial part recognition on a single GPU, case study.
An industrial distributor needed visual recognition across 6,200 SKUs without the cost of training a full vision model. We LoRA-tuned OpenCLIP on warehouse photos and shipped an 84 MB adapter that runs inference at production scale on commodity GPUs.

Overview
Why this study matters
Parameter-efficient fine-tuning of OpenCLIP ViT-L/14 with LoRA adapters on 18,000 SKU photos — 97.4% accuracy versus 78.1% zero-shot, 11 hours on a single RTX 4090.
Client: An Indian industrial parts distributor with 6,200 SKUs
Project Type: Model Tuning + Computer Vision
Industry: Industrial distribution / wholesale
Service Used: LLM / VLM Fine-Tuning + LoRA + PEFT
Results in numbers
What the engagement actually shipped.
- 97.4%
- Top-1 accuracy
- +19.3 pp
- Lift over zero-shot baseline
- 84 MB
- Adapter size (vs 1.7 GB full FT)
- 11 h
- Training on 1 × RTX 4090
- ₹1.4 L
- Total training cost
Objectives
What the project needed to achieve
- Hit ≥ 95% top-1 accuracy across 6,200 SKUs from warehouse photographs
- Keep training cost under ₹2 lakh end-to-end
- Ship an adapter small enough to deploy on commodity inference GPUs (T4, L4, RTX 4090)
- Make the fine-tuning pipeline reproducible so the client can extend it as the catalog grows
Challenge
Engineering constraint
The client needed visual SKU recognition for receiving, kitting, and stock-counting workflows but couldn’t justify the GPU budget for training a custom vision model from scratch (estimated ₹45 lakh on cloud compute alone for the dataset size). Off-the-shelf CLIP got them to 78.1% zero-shot accuracy across the 6,200 SKU catalog, which was useful but not deployable. They needed to close the gap to >95% while keeping training cost in the ₹1–2 lakh range.
Approach
How Yantrix approached the work
- 01Curated a labeled dataset of 18,000 warehouse photographs spanning the SKU catalog — mix of receiving-dock photos and existing product-listing images, with active-learning to focus labeling effort on SKUs CLIP zero-shot struggled with.
- 02Applied LoRA adapters (r=16, alpha=32) to the OpenCLIP ViT-L/14 visual encoder, with the text encoder frozen — standard PEFT pattern for vision-language adaptation.
- 03Trained on a single RTX 4090 with mixed-precision BF16 for 11 hours over 18 epochs, with cosine learning-rate schedule and a held-out test set per SKU class.
- 04Compared against full fine-tuning (1.7 GB checkpoint, 4× the training time, 2× the GPU memory) and DoRA (slightly better accuracy at the same parameter count); shipped LoRA for its operational simplicity.
- 05Built a serving setup using Predibase’s multi-adapter pattern so the same base CLIP model can serve multiple LoRA adapters (e.g. one per warehouse) on a single GPU without per-adapter deployment cost.
Outcomes
What improved by the end
- 97.4% top-1 accuracy across 6,200 SKUs (78.1% zero-shot baseline)
- Training cost ₹1.4 lakh end-to-end — ₹1.1 lakh on GPU rental + ₹0.3 lakh on data labeling
- Adapter size 84 MB versus 1.7 GB full fine-tune — ships over 4G in seconds
- 11 hours of single-GPU training versus 44 hours full FT
- Multi-adapter serving lets future warehouse-specific tuning ride on the same base GPU
Deliverables
What the client receives
- Trained LoRA adapter for OpenCLIP ViT-L/14
- Reproducible fine-tuning pipeline (Axolotl config + Weights & Biases run)
- Evaluation report comparing LoRA, DoRA, and full fine-tune
- Multi-adapter serving configuration on Predibase LoRAX-style infrastructure
- Data-curation playbook so the client can retrain as the catalog evolves
Tools used
Stack and tooling
- OpenCLIP ViT-L/14
- Hugging Face PEFT library for LoRA
- Axolotl for fine-tuning orchestration
- Weights & Biases for experiment tracking
- RTX 4090 (single GPU)
- Predibase LoRAX-style multi-adapter serving pattern
Impact
Business-level effect
- Stock-count workflow throughput up by ~3.4× vs. manual identification
- Receiving-dock mis-classification incidents down to a handful per month
- Pattern is now the client’s default for vision-model adaptation across other use cases
Conclusion
Full fine-tuning is overkill for most domain-adaptation problems in 2026. LoRA, QLoRA, and DoRA hit the same quality bar at 5–10% of the cost — the right default for production vision-language tuning.
Next step
Have a vision or LLM use case where zero-shot accuracy gets you 70–80% of the way but not quite to production? LoRA adaptation usually closes that gap at a fraction of full-FT cost.
Tagged
- LoRA
- PEFT
- CLIP
- Fine-Tuning
- Vision-Language Model
- Predibase
Visual results
Key views and intermediate artefacts


Adapter-vs-full-FT size
Frequently asked questions
Answers from the engagement itself.
LoRA vs QLoRA vs DoRA — which should I use in 2026?
LoRA is the safe default for most adaptation tasks. QLoRA quantizes the base model, which is essential when you can’t fit it on the GPU otherwise (Llama 3 70B on a single A100, for example). DoRA edges out LoRA on benchmarks by 1–2 percentage points but is more complex to tune; reach for it when LoRA plateaus.
What hardware do I need for LoRA fine-tuning of a vision model?
A single RTX 4090 (24 GB) handles LoRA on ViT-L and smaller VLMs. ViT-H or 7B-class VLMs (Qwen2-VL-7B, LLaVA) need an A6000 or A100. QLoRA pushes 70B-class models onto a single A100. The era of needing an 8×H100 cluster for adaptation is over.
Can I serve many LoRA adapters from the same base GPU?
Yes — Predibase’s LoRAX framework and vLLM’s multi-LoRA support both serve dozens of adapters on one GPU deployment, routing requests at runtime. This is the cost unlock for task-specific or customer-specific tuning: one base model, many adapters, one GPU bill.
What does a vision-language LoRA fine-tuning engagement cost in India?
Typical engagement: data curation + LoRA training + serving setup + handover: ₹8–22 lakh, 8–14 weeks. Total GPU spend for the training itself usually ₹1–3 lakh on a single high-end consumer or workstation GPU rental.
Related case studies
Adjacent proof you can read next.

GenAI · Retrieval-Augmented Generation
Production RAG over 40,000 engineering documents
How a hybrid-search RAG system over 40k engineering PDFs and CAD drawings cut average engineer-question turnaround from 35 minutes to 22 seconds, with grounded citations on every answer.

MLOps · Edge AI Fleet
MLOps platform for a 600-device edge AI fleet
End-to-end MLOps platform managing 600 Jetson inspection cameras across 14 sites — median model deploy went from 9 days to 38 minutes, with automatic drift-triggered rollback.
Continue exploring
Related blogs, services, and capability pages
Cross-links help readers move from proof into capability and educational content, and they reinforce the crawl path between commercial pages and reference content.
Have a machine to build? Let's scope it together.
Tell us about your project. We'll respond within 1-2 business days with a preliminary scope and timeline — no boilerplate, no up-sell.