
MLOps platform for a 600-device edge AI fleet, case study.
An industrial client had 600 Jetson-deployed inspection cameras across 14 plants and no operational discipline around them — model deploys were manual, drift went undetected, and a bad model meant 14 plant visits. We built an MLOps platform that operates the fleet like a software service.

Overview
Why this study matters
End-to-end MLOps platform managing 600 Jetson inspection cameras across 14 sites — median model deploy went from 9 days to 38 minutes, with automatic drift-triggered rollback.
Client: A multi-plant Indian manufacturing group operating 14 facilities
Project Type: MLOps + Edge AI Operations
Industry: Multi-plant manufacturing
Service Used: MLOps + Edge AI + Platform Engineering
Results in numbers
What the engagement actually shipped.
- 38 min
- Median model deploy
- 9 days
- Old baseline deploy time
- 600
- Jetson devices managed
- 14
- Sites under one platform
- NPS 62
- Operator satisfaction
Objectives
What the project needed to achieve
- Manage 600+ Jetson devices across 14 sites as one operated fleet
- Sign + verify model artifacts so only trusted models reach the field
- Cut median time-to-deploy a new model from 9 days to under 1 hour
- Detect accuracy drift automatically and roll back without human intervention
- Give plant operators a Foxglove-style dashboard rather than a black box
Challenge
Engineering constraint
The client had spent two years deploying 600 inspection cameras across 14 plants, but operating them as a fleet was a manual nightmare. Each model update required physically visiting plants. Accuracy drift on changing SKU mix went undetected until line operators escalated. A bad model that slipped through testing meant a 2-week recall across 14 sites. The team needed an operational platform that treated 600 edge devices as a managed service, not 600 individual deployments.
Approach
How Yantrix approached the work
- 01Stood up an MLflow registry as the single source of truth for trained models, with strict promotion gates from staging → canary → production tied to evaluation-set metrics.
- 02Built a signed-OTA rollout system — every model artifact is signed at promotion, devices verify the signature before loading, and the rollout pipeline targets canary devices first (5% of fleet) before expanding.
- 03Implemented per-device drift detection — each camera logs feature-distribution statistics on a rolling window; the platform flags drift > 2σ from the canary baseline and either alerts or auto-rolls back depending on policy.
- 04Built a fleet dashboard (FastAPI + React + Grafana) showing per-device latency, accuracy proxy, last deploy timestamp, drift status, and a one-click rollback action for plant managers.
- 05Wrote runbooks and ran two operations bootcamps so the client’s plant IT teams could operate the platform without a Yantrix engineer on call.
Outcomes
What improved by the end
- Median model deploy: 9 days → 38 minutes
- 5× canary deploys per week with zero unannounced production-wide rollouts
- Drift-triggered automatic rollback rate: ~1.2 per month across the fleet (none escalated to operator complaint)
- Plant-IT operating cost reduced — no more 14-plant visits per deploy
- Operator dashboard satisfaction tracked at NPS 62 across 14 sites
Deliverables
What the client receives
- MLflow registry with documented promotion gates
- Signed-OTA rollout pipeline (Cosign + S3 + canary policy)
- Custom Jetson agent in Rust with OTA + telemetry
- Fleet dashboard (React + FastAPI + TimescaleDB + Grafana)
- Drift detection + auto-rollback policy with audit log
- Operator runbooks and two on-site operations bootcamps
Tools used
Stack and tooling
- MLflow Model Registry
- Cosign for artifact signing + verification
- AWS S3 + CloudFront for signed-OTA artifact delivery
- PostgreSQL + TimescaleDB for fleet telemetry
- React + Foxglove for the operator dashboard
- Prometheus + Grafana for observability
- Custom Jetson agent in Rust for OTA + telemetry
Impact
Business-level effect
- Plant teams now ship model improvements weekly instead of quarterly
- Confidence to deploy more aggressive models because rollback is automated
- ML team in HQ now spends time on model quality, not deployment plumbing
Conclusion
The hardest part of edge AI isn’t the model — it’s operating the fleet after deployment. An MLOps platform that treats edge devices like a managed service compounds over time; without one, every new model is a new operational risk.
Next step
Have an edge AI deployment that grew past 20 devices and is starting to feel unmanageable? Send us your current architecture; we’ll map the path to a managed fleet.
Tagged
- MLOps
- Edge AI
- Jetson
- Model Registry
- Drift Detection
- OTA
Visual results
Key views and intermediate artefacts


Operator fleet dashboard
Frequently asked questions
Answers from the engagement itself.
When does an edge AI deployment need an MLOps platform?
Above 20–30 devices, manual operations break. Below that, scripts and a shared spreadsheet work. The forcing function is usually drift — once you have enough devices that you can’t manually inspect each one’s accuracy weekly, you need automated detection and rollback.
Do I need MLflow specifically, or can I use SageMaker / Vertex / ClearML?
Use whichever registry your team operates. We default to MLflow for self-hosted, AWS SageMaker for AWS-native teams, ClearML for regulated industries, TrueFoundry for Indian startup-stack teams. The platform-level architecture is the same; the registry is interchangeable.
What does an edge AI MLOps engagement cost in India?
Multi-device fleet platform (50–500 devices, custom dashboard, OTA, drift detection): ₹25–80 lakh, 4–9 months. Above 500 devices the platform engineering scales sub-linearly so the marginal cost per device drops. Maintenance and platform evolution typically 15–20% of build cost per year.
Related case studies
Adjacent proof you can read next.

Applied AI · Vision-guided robotics
Vision-guided bin picking at 80 ms end-to-end
How a YOLOv11-Seg + 3D-pose stack on a Jetson Orin Nano replaced fixed-pose jigs in a 6-DOF robotic cell — sub-80 ms latency, 99.2% accuracy, 40% throughput gain.

Edge AI · On-device inspection
Zero-cloud defect detection camera on ESP32-S3
A production conveyor inspection camera running a quantized INT8 CNN entirely on an ESP32-S3 — 18 FPS at 0.4 W, no cloud, 6× lower capex per station.
Continue exploring
Related blogs, services, and capability pages
Cross-links help readers move from proof into capability and educational content, and they reinforce the crawl path between commercial pages and reference content.
Have a machine to build? Let's scope it together.
Tell us about your project. We'll respond within 1-2 business days with a preliminary scope and timeline — no boilerplate, no up-sell.