MLOps · Edge AI Fleet

MLOps platform for a 600-device edge AI fleet, case study.

An industrial client had 600 Jetson-deployed inspection cameras across 14 plants and no operational discipline around them — model deploys were manual, drift went undetected, and a bad model meant 14 plant visits. We built an MLOps platform that operates the fleet like a software service.

By Yantrix Engineering · MLOps + Platform Engineering3 min readMulti-plant manufacturing
MLOps fleet dashboard managing 600 Jetson devices for edge AI inspection cameras

Overview

Why this study matters

End-to-end MLOps platform managing 600 Jetson inspection cameras across 14 sites — median model deploy went from 9 days to 38 minutes, with automatic drift-triggered rollback.

Client: A multi-plant Indian manufacturing group operating 14 facilities

Project Type: MLOps + Edge AI Operations

Industry: Multi-plant manufacturing

Service Used: MLOps + Edge AI + Platform Engineering

Results in numbers

What the engagement actually shipped.

38 min
Median model deploy
9 days
Old baseline deploy time
600
Jetson devices managed
14
Sites under one platform
NPS 62
Operator satisfaction

Objectives

What the project needed to achieve

  • Manage 600+ Jetson devices across 14 sites as one operated fleet
  • Sign + verify model artifacts so only trusted models reach the field
  • Cut median time-to-deploy a new model from 9 days to under 1 hour
  • Detect accuracy drift automatically and roll back without human intervention
  • Give plant operators a Foxglove-style dashboard rather than a black box

Challenge

Engineering constraint

The client had spent two years deploying 600 inspection cameras across 14 plants, but operating them as a fleet was a manual nightmare. Each model update required physically visiting plants. Accuracy drift on changing SKU mix went undetected until line operators escalated. A bad model that slipped through testing meant a 2-week recall across 14 sites. The team needed an operational platform that treated 600 edge devices as a managed service, not 600 individual deployments.

Approach

How Yantrix approached the work

  1. 01Stood up an MLflow registry as the single source of truth for trained models, with strict promotion gates from staging → canary → production tied to evaluation-set metrics.
  2. 02Built a signed-OTA rollout system — every model artifact is signed at promotion, devices verify the signature before loading, and the rollout pipeline targets canary devices first (5% of fleet) before expanding.
  3. 03Implemented per-device drift detection — each camera logs feature-distribution statistics on a rolling window; the platform flags drift > 2σ from the canary baseline and either alerts or auto-rolls back depending on policy.
  4. 04Built a fleet dashboard (FastAPI + React + Grafana) showing per-device latency, accuracy proxy, last deploy timestamp, drift status, and a one-click rollback action for plant managers.
  5. 05Wrote runbooks and ran two operations bootcamps so the client’s plant IT teams could operate the platform without a Yantrix engineer on call.

Outcomes

What improved by the end

  • Median model deploy: 9 days → 38 minutes
  • 5× canary deploys per week with zero unannounced production-wide rollouts
  • Drift-triggered automatic rollback rate: ~1.2 per month across the fleet (none escalated to operator complaint)
  • Plant-IT operating cost reduced — no more 14-plant visits per deploy
  • Operator dashboard satisfaction tracked at NPS 62 across 14 sites

Deliverables

What the client receives

  • MLflow registry with documented promotion gates
  • Signed-OTA rollout pipeline (Cosign + S3 + canary policy)
  • Custom Jetson agent in Rust with OTA + telemetry
  • Fleet dashboard (React + FastAPI + TimescaleDB + Grafana)
  • Drift detection + auto-rollback policy with audit log
  • Operator runbooks and two on-site operations bootcamps

Tools used

Stack and tooling

  • MLflow Model Registry
  • Cosign for artifact signing + verification
  • AWS S3 + CloudFront for signed-OTA artifact delivery
  • PostgreSQL + TimescaleDB for fleet telemetry
  • React + Foxglove for the operator dashboard
  • Prometheus + Grafana for observability
  • Custom Jetson agent in Rust for OTA + telemetry

Impact

Business-level effect

  • Plant teams now ship model improvements weekly instead of quarterly
  • Confidence to deploy more aggressive models because rollback is automated
  • ML team in HQ now spends time on model quality, not deployment plumbing

Conclusion

The hardest part of edge AI isn’t the model — it’s operating the fleet after deployment. An MLOps platform that treats edge devices like a managed service compounds over time; without one, every new model is a new operational risk.

Next step

Have an edge AI deployment that grew past 20 devices and is starting to feel unmanageable? Send us your current architecture; we’ll map the path to a managed fleet.

Tagged

  • MLOps
  • Edge AI
  • Jetson
  • Model Registry
  • Drift Detection
  • OTA

Frequently asked questions

Answers from the engagement itself.

When does an edge AI deployment need an MLOps platform?

Above 20–30 devices, manual operations break. Below that, scripts and a shared spreadsheet work. The forcing function is usually drift — once you have enough devices that you can’t manually inspect each one’s accuracy weekly, you need automated detection and rollback.

Do I need MLflow specifically, or can I use SageMaker / Vertex / ClearML?

Use whichever registry your team operates. We default to MLflow for self-hosted, AWS SageMaker for AWS-native teams, ClearML for regulated industries, TrueFoundry for Indian startup-stack teams. The platform-level architecture is the same; the registry is interchangeable.

What does an edge AI MLOps engagement cost in India?

Multi-device fleet platform (50–500 devices, custom dashboard, OTA, drift detection): ₹25–80 lakh, 4–9 months. Above 500 devices the platform engineering scales sub-linearly so the marginal cost per device drops. Maintenance and platform evolution typically 15–20% of build cost per year.

Related case studies

Adjacent proof you can read next.

Continue exploring

Related blogs, services, and capability pages

Cross-links help readers move from proof into capability and educational content, and they reinforce the crawl path between commercial pages and reference content.

Let's build

Have a machine to build? Let's scope it together.

Tell us about your project. We'll respond within 1-2 business days with a preliminary scope and timeline — no boilerplate, no up-sell.