Applied AI · Vision-guided robotics

Vision-guided bin picking at 80 ms end-to-end, case study.

A fulfilment client needed reliable bin picking without pose fixtures. We delivered a YOLOv11-Seg + 3D pose stack running on Jetson Orin Nano, fully integrated with their MoveIt motion planner, hitting sub-80 ms decision latency.

Vision-guided bin picking robotic cell with detection overlay

Overview

Why this study matters

Yantrix built a production vision stack that lets a 6-DOF arm pick randomly oriented SKUs out of a cluttered bin — running entirely on an edge device.

Project Type: Applied AI + Robotic Manipulation

Industry: Warehouse automation

Service Used: Computer Vision + ROS 2 Integration

Objective

What the project needed to achieve

Detect and segment random-pose parts inside cluttered bins
Estimate 6-DOF pick pose for a parallel-jaw gripper
Run the full perception stack on embedded hardware at the cell
Integrate with existing MoveIt motion planning with zero PLC changes

Challenge

Engineering constraint

The client was operating a robotic cell that required fixed-pose presentation jigs for every SKU. Throughput was capped by manual part-staging and changeover. They needed vision-based picking that could generalize across SKUs without retooling and run on the edge — no cloud round-trips allowed on the production floor.

Deliverables

What the client receives

Trained and quantized vision model with reproducible training pipeline
ROS 2 perception package and MoveIt integration
Camera, lens, and lighting specification for the cell
Benchmark report: accuracy per class, latency distribution, failure modes
Retraining playbook so the client can extend to new SKUs themselves

Visual results

Key simulation and design views

YOLOv11-Seg detection and segmentation overlay on bin contents

Detection + mask overlay

ROS 2 integration

Jetson Orin Nano running the perception stack at the cell

Jetson Orin deployment

Approach

How Yantrix approached the work

Collected and labelled a dataset of the client's top 40 SKUs inside representative bin clutter, then fine-tuned a YOLOv11-Seg detector with rotation and occlusion augmentation.
Layered a depth-based pose-estimation stage on top of 2D masks using the ZED 2i stereo camera, filtering picks by graspability (approach angle, jaw clearance, surface normal).
Quantized the detector to FP16 and exported through TensorRT targeting Jetson Orin Nano; benchmarked camera-to-command latency under realistic lighting.
Exposed the perception stack as a ROS 2 action server so the existing MoveIt planner could request picks without any downstream refactor.

Outcome

What improved by the end

Sub-80 ms end-to-end decision latency (capture -> model -> grasp command)
99.2% detection accuracy across the labelled SKU set
False-pick rate reduced to 1.4 per 1,000 attempts under production lighting
Eliminated the need for per-SKU staging jigs — changeover now data-only
Fully edge-deployed — zero production cloud dependencies

Tools used

Ultralytics YOLOv11-Seg
PyTorch + TensorRT (FP16)
ROS 2 Humble + MoveIt 2
ZED 2i stereo camera
NVIDIA Jetson Orin Nano 8GB
Roboflow for dataset ops

Impact

Cell throughput up by ~40% vs. fixed-pose baseline
Operator labor reallocated away from part-staging
Extensibility to new SKUs without mechanical changes

Conclusion

The stack shows what becomes possible when vision, control, and hardware are designed as one system rather than handed across vendors. It's a playbook we re-use for any vision-guided manipulation project.

Next step

Have a robotic cell bottlenecked by manual staging, fixed jigs, or cloud-dependent vision? Let's talk about bringing the perception on-device.

Get in touch

Let's build

Have a machine to build? Let's scope it together.

Tell us about your project. We'll respond within 1-2 business days with a preliminary scope and timeline — no boilerplate, no up-sell.

Start your project View full portfolio