Applied AI · Vision-guided robotics

Vision-guided bin picking at 80 ms end-to-end, case study.

A fulfilment client needed reliable bin picking without pose fixtures. We delivered a YOLOv11-Seg + 3D pose stack running on Jetson Orin Nano, fully integrated with their MoveIt motion planner, hitting sub-80 ms decision latency.

By Yantrix Engineering · Applied AI Studio2 min readWarehouse automation
Vision-guided bin picking robotic cell with detection overlay

Overview

Why this study matters

How a YOLOv11-Seg + 3D-pose stack on a Jetson Orin Nano replaced fixed-pose jigs in a 6-DOF robotic cell — sub-80 ms latency, 99.2% accuracy, 40% throughput gain.

Client: A US-based fulfillment automation startup

Project Type: Applied AI + Robotic Manipulation

Industry: Warehouse automation

Service Used: Computer Vision + ROS 2 Integration

Results in numbers

What the engagement actually shipped.

80 ms
End-to-end decision latency
99.2%
Detection accuracy
40%
Cell throughput gain
0
Cloud round-trips per pick

Objectives

What the project needed to achieve

  • Detect and segment random-pose parts inside cluttered bins
  • Estimate 6-DOF pick pose for a parallel-jaw gripper
  • Run the full perception stack on embedded hardware at the cell
  • Integrate with existing MoveIt motion planning with zero PLC changes

Challenge

Engineering constraint

The client was operating a robotic cell that required fixed-pose presentation jigs for every SKU. Throughput was capped by manual part-staging and changeover. They needed vision-based picking that could generalize across SKUs without retooling and run on the edge — no cloud round-trips allowed on the production floor.

Approach

How Yantrix approached the work

  1. 01Collected and labelled a dataset of the client’s top 40 SKUs inside representative bin clutter, then fine-tuned a YOLOv11-Seg detector with rotation and occlusion augmentation.
  2. 02Layered a depth-based pose-estimation stage on top of 2D masks using the ZED 2i stereo camera, filtering picks by graspability (approach angle, jaw clearance, surface normal).
  3. 03Quantized the detector to FP16 and exported through TensorRT targeting Jetson Orin Nano; benchmarked camera-to-command latency under realistic lighting.
  4. 04Exposed the perception stack as a ROS 2 action server so the existing MoveIt planner could request picks without any downstream refactor.

Outcomes

What improved by the end

  • Sub-80 ms end-to-end decision latency (capture → model → grasp command)
  • 99.2% detection accuracy across the labelled SKU set
  • False-pick rate reduced to 1.4 per 1,000 attempts under production lighting
  • Eliminated the need for per-SKU staging jigs — changeover now data-only
  • Fully edge-deployed — zero production cloud dependencies

Deliverables

What the client receives

  • Trained and quantized vision model with reproducible training pipeline
  • ROS 2 perception package and MoveIt integration
  • Camera, lens, and lighting specification for the cell
  • Benchmark report: accuracy per class, latency distribution, failure modes
  • Retraining playbook so the client can extend to new SKUs themselves

Tools used

Stack and tooling

  • Ultralytics YOLOv11-Seg
  • PyTorch + TensorRT (FP16)
  • ROS 2 Humble + MoveIt 2
  • ZED 2i stereo camera
  • NVIDIA Jetson Orin Nano 8GB
  • Roboflow for dataset ops

Impact

Business-level effect

  • Cell throughput up by ~40% vs. fixed-pose baseline
  • Operator labor reallocated away from part-staging
  • Extensibility to new SKUs without mechanical changes

Conclusion

The stack shows what becomes possible when vision, control, and hardware are designed as one system rather than handed across vendors. It’s a playbook we re-use for any vision-guided manipulation project.

Next step

Have a robotic cell bottlenecked by manual staging, fixed jigs, or cloud-dependent vision? Let’s talk about bringing the perception on-device.

Tagged

  • YOLOv11
  • Jetson Orin
  • ROS 2
  • MoveIt
  • Bin Picking
  • Edge AI

Frequently asked questions

Answers from the engagement itself.

What latency can you hit on a Jetson Orin Nano for bin picking?

Sub-80 ms end-to-end — camera capture through inference, NMS, pose estimation, and grasp-command emission. The YOLOv11-Seg forward pass itself is around 22 ms in FP16; the rest of the budget goes to preprocessing, stereo depth, and graspability filtering.

Can vision-guided bin picking replace fixed-pose jigs in production?

Yes, when the pipeline is engineered end-to-end. We routinely retire per-SKU jigs by combining a fine-tuned segmentation model with depth-based pose estimation and graspability filtering. The harder problem is usually lighting and camera placement, not the model.

Do you ship the retraining pipeline so the client can add SKUs later?

Always. Every deployment includes a documented retraining playbook — dataset format, label conventions, augmentation pipeline, and the export-to-TensorRT script. New SKUs onboard with data, not code.

Related case studies

Adjacent proof you can read next.

Continue exploring

Related blogs, services, and capability pages

Cross-links help readers move from proof into capability and educational content, and they reinforce the crawl path between commercial pages and reference content.

Let's build

Have a machine to build? Let's scope it together.

Tell us about your project. We'll respond within 1-2 business days with a preliminary scope and timeline — no boilerplate, no up-sell.