MLOps · Edge AI Fleet

MLOps platform for a 600-device edge AI fleet, case study.

An industrial client had 600 Jetson-deployed inspection cameras across 14 plants and no operational discipline around them — model deploys were manual, drift went undetected, and a bad model meant 14 plant visits. We built an MLOps platform that operates the fleet like a software service.

By Yantrix Engineering · MLOps + Platform EngineeringPublished May 4, 2026Updated May 12, 20263 min readMulti-plant manufacturing

MLOps fleet dashboard managing 600 Jetson devices for edge AI inspection cameras

Overview

Why this study matters

End-to-end MLOps platform managing 600 Jetson inspection cameras across 14 sites — median model deploy went from 9 days to 38 minutes, with automatic drift-triggered rollback.

Client: A multi-plant Indian manufacturing group operating 14 facilities

Project Type: MLOps + Edge AI Operations

Industry: Multi-plant manufacturing

Service Used: MLOps + Edge AI + Platform Engineering

Results in numbers

What the engagement actually shipped.

38 min: Median model deploy
9 days: Old baseline deploy time
600: Jetson devices managed
14: Sites under one platform
NPS 62: Operator satisfaction

Objectives

What the project needed to achieve

Manage 600+ Jetson devices across 14 sites as one operated fleet
Sign + verify model artifacts so only trusted models reach the field
Cut median time-to-deploy a new model from 9 days to under 1 hour
Detect accuracy drift automatically and roll back without human intervention
Give plant operators a Foxglove-style dashboard rather than a black box

Challenge

Engineering constraint

The client had spent two years deploying 600 inspection cameras across 14 plants, but operating them as a fleet was a manual nightmare. Each model update required physically visiting plants. Accuracy drift on changing SKU mix went undetected until line operators escalated. A bad model that slipped through testing meant a 2-week recall across 14 sites. The team needed an operational platform that treated 600 edge devices as a managed service, not 600 individual deployments.

Approach

How Yantrix approached the work

01Stood up an MLflow registry as the single source of truth for trained models, with strict promotion gates from staging → canary → production tied to evaluation-set metrics.
02Built a signed-OTA rollout system — every model artifact is signed at promotion, devices verify the signature before loading, and the rollout pipeline targets canary devices first (5% of fleet) before expanding.
03Implemented per-device drift detection — each camera logs feature-distribution statistics on a rolling window; the platform flags drift > 2σ from the canary baseline and either alerts or auto-rolls back depending on policy.
04Built a fleet dashboard (FastAPI + React + Grafana) showing per-device latency, accuracy proxy, last deploy timestamp, drift status, and a one-click rollback action for plant managers.
05Wrote runbooks and ran two operations bootcamps so the client’s plant IT teams could operate the platform without a Yantrix engineer on call.

Outcomes

What improved by the end

Median model deploy: 9 days → 38 minutes
5× canary deploys per week with zero unannounced production-wide rollouts
Drift-triggered automatic rollback rate: ~1.2 per month across the fleet (none escalated to operator complaint)
Plant-IT operating cost reduced — no more 14-plant visits per deploy
Operator dashboard satisfaction tracked at NPS 62 across 14 sites

Deliverables

What the client receives

MLflow registry with documented promotion gates
Signed-OTA rollout pipeline (Cosign + S3 + canary policy)
Custom Jetson agent in Rust with OTA + telemetry
Fleet dashboard (React + FastAPI + TimescaleDB + Grafana)
Drift detection + auto-rollback policy with audit log
Operator runbooks and two on-site operations bootcamps

Tools used

Stack and tooling

MLflow Model Registry
Cosign for artifact signing + verification
AWS S3 + CloudFront for signed-OTA artifact delivery
PostgreSQL + TimescaleDB for fleet telemetry
React + Foxglove for the operator dashboard
Prometheus + Grafana for observability
Custom Jetson agent in Rust for OTA + telemetry

Impact

Business-level effect

Plant teams now ship model improvements weekly instead of quarterly
Confidence to deploy more aggressive models because rollback is automated
ML team in HQ now spends time on model quality, not deployment plumbing

Conclusion

The hardest part of edge AI isn’t the model — it’s operating the fleet after deployment. An MLOps platform that treats edge devices like a managed service compounds over time; without one, every new model is a new operational risk.

Next step

Have an edge AI deployment that grew past 20 devices and is starting to feel unmanageable? Send us your current architecture; we’ll map the path to a managed fleet.

Get in touch

Tagged

MLOps
Edge AI
Jetson
Model Registry
Drift Detection
OTA

Visual results

Key views and intermediate artefacts

Signed-OTA rollout pipeline

Operator fleet dashboard

Frequently asked questions

Answers from the engagement itself.

When does an edge AI deployment need an MLOps platform?

Above 20–30 devices, manual operations break. Below that, scripts and a shared spreadsheet work. The forcing function is usually drift — once you have enough devices that you can’t manually inspect each one’s accuracy weekly, you need automated detection and rollback.

Do I need MLflow specifically, or can I use SageMaker / Vertex / ClearML?

Use whichever registry your team operates. We default to MLflow for self-hosted, AWS SageMaker for AWS-native teams, ClearML for regulated industries, TrueFoundry for Indian startup-stack teams. The platform-level architecture is the same; the registry is interchangeable.

What does an edge AI MLOps engagement cost in India?

Multi-device fleet platform (50–500 devices, custom dashboard, OTA, drift detection): ₹25–80 lakh, 4–9 months. Above 500 devices the platform engineering scales sub-linearly so the marginal cost per device drops. Maintenance and platform evolution typically 15–20% of build cost per year.

Related case studies

Adjacent proof you can read next.

Vision-guided bin picking robotic cell with detection overlay

Applied AI · Vision-guided robotics

Vision-guided bin picking at 80 ms end-to-end

How a YOLOv11-Seg + 3D-pose stack on a Jetson Orin Nano replaced fixed-pose jigs in a 6-DOF robotic cell — sub-80 ms latency, 99.2% accuracy, 40% throughput gain.

Read case study

Edge AI defect detection camera on electronics assembly line

Edge AI · On-device inspection

Zero-cloud defect detection camera on ESP32-S3

A production conveyor inspection camera running a quantized INT8 CNN entirely on an ESP32-S3 — 18 FPS at 0.4 W, no cloud, 6× lower capex per station.

Read case study

Continue exploring

Related blogs, services, and capability pages

Cross-links help readers move from proof into capability and educational content, and they reinforce the crawl path between commercial pages and reference content.

Service pages

From the blog

Let's build

Have a machine to build? Let's scope it together.

Tell us about your project. We'll respond within 1-2 business days with a preliminary scope and timeline — no boilerplate, no up-sell.

Start your project View full portfolio