
Vision-guided bin picking at 80 ms end-to-end, case study.
A fulfilment client needed reliable bin picking without pose fixtures. We delivered a YOLOv11-Seg + 3D pose stack running on Jetson Orin Nano, fully integrated with their MoveIt motion planner, hitting sub-80 ms decision latency.

Overview
Why this study matters
Yantrix built a production vision stack that lets a 6-DOF arm pick randomly oriented SKUs out of a cluttered bin — running entirely on an edge device.
Project Type: Applied AI + Robotic Manipulation
Industry: Warehouse automation
Service Used: Computer Vision + ROS 2 Integration
Objective
What the project needed to achieve
- Detect and segment random-pose parts inside cluttered bins
- Estimate 6-DOF pick pose for a parallel-jaw gripper
- Run the full perception stack on embedded hardware at the cell
- Integrate with existing MoveIt motion planning with zero PLC changes
Challenge
Engineering constraint
The client was operating a robotic cell that required fixed-pose presentation jigs for every SKU. Throughput was capped by manual part-staging and changeover. They needed vision-based picking that could generalize across SKUs without retooling and run on the edge — no cloud round-trips allowed on the production floor.
Deliverables
What the client receives
- Trained and quantized vision model with reproducible training pipeline
- ROS 2 perception package and MoveIt integration
- Camera, lens, and lighting specification for the cell
- Benchmark report: accuracy per class, latency distribution, failure modes
- Retraining playbook so the client can extend to new SKUs themselves
Visual results
Key simulation and design views


ROS 2 integration

Jetson Orin deployment
Approach
How Yantrix approached the work
- Collected and labelled a dataset of the client's top 40 SKUs inside representative bin clutter, then fine-tuned a YOLOv11-Seg detector with rotation and occlusion augmentation.
- Layered a depth-based pose-estimation stage on top of 2D masks using the ZED 2i stereo camera, filtering picks by graspability (approach angle, jaw clearance, surface normal).
- Quantized the detector to FP16 and exported through TensorRT targeting Jetson Orin Nano; benchmarked camera-to-command latency under realistic lighting.
- Exposed the perception stack as a ROS 2 action server so the existing MoveIt planner could request picks without any downstream refactor.
Outcome
What improved by the end
- Sub-80 ms end-to-end decision latency (capture -> model -> grasp command)
- 99.2% detection accuracy across the labelled SKU set
- False-pick rate reduced to 1.4 per 1,000 attempts under production lighting
- Eliminated the need for per-SKU staging jigs — changeover now data-only
- Fully edge-deployed — zero production cloud dependencies
Tools used
- Ultralytics YOLOv11-Seg
- PyTorch + TensorRT (FP16)
- ROS 2 Humble + MoveIt 2
- ZED 2i stereo camera
- NVIDIA Jetson Orin Nano 8GB
- Roboflow for dataset ops
Impact
- Cell throughput up by ~40% vs. fixed-pose baseline
- Operator labor reallocated away from part-staging
- Extensibility to new SKUs without mechanical changes
Conclusion
The stack shows what becomes possible when vision, control, and hardware are designed as one system rather than handed across vendors. It's a playbook we re-use for any vision-guided manipulation project.
Next step
Have a robotic cell bottlenecked by manual staging, fixed jigs, or cloud-dependent vision? Let's talk about bringing the perception on-device.
Have a machine to build? Let's scope it together.
Tell us about your project. We'll respond within 1-2 business days with a preliminary scope and timeline — no boilerplate, no up-sell.