CONTENTS

AgenticLab: A Real-World Robot Agent Platform
that Can See, Think, and Act

Pengyuan Guo1,* Zhonghao Mai1,* Zhengtong Xu1,* Kaidi Zhang1 Heng Zhang2 Zichen Miao1
Arash Ajoudani2 Zachary Kingston1 Qiang Qiu1 Yu She1
1Purdue University 2Instituto Italiano di Tecnologia *Equal contribution

Rocket The full hardware–software stack of AgenticLab will be released.

Key Contributions

Closed-loop Vision-Language Robot Agent

We introduce a robot agent pipeline for open-world, open-vocabulary manipulation with vision-language closed-loop reasoning.

Model-Agnostic VLM Interface

Different VLMs (e.g., Gemini, GPT, Qwen) can be seamlessly swapped through a unified interface, enabling controlled evaluation without model-specific engineering.

Real-World Embodied Benchmark

Our benchmark measures grounded perception, spatial reasoning, and long-horizon decision-making under closed-loop execution on physical robots.

Open-Source, Deployable Platform

We release a reproducible hardware–software stack that integrates sensing, control, and agentic reasoning, enabling direct deployment of VLM-based agents in the wild.

Real-World Robot Manipulation

We design a suite of five real-world robot manipulation tasks to evaluate embodied vision-language reasoning under challenging conditions.

The open-vocabulary prompts used for each task are displayed below.

Sorting

The agent categorizes objects into designated bins based on semantic attributes, demonstrating open-vocabulary grounding and compositional reasoning.

In-the-wild Kitchen

Move the food to the bowl.

Lab Scene

Sort the toys to the blue bin.

In-the-wild Kitchen

Sort the toys to the blue bin.

In-the-wild Outdoor

Sort the toys to the blue bin.

Stacking

Vertically arranging objects in specified order, requiring precise placement and sequential planning with strong dependencies.

Lab Scene

Stack the cubes on the pink plate from bottom to top: orange, yellow, green and blue.

In-the-wild Kitchen

Stack the cubes on the pink plate from bottom to top: orange, blue, yellow, and green.

In-the-wild Kitchen

Stack the cubes on the orange plate from bottom to top: orange, blue, yellow, and green.

In-the-wild Outdoor

Stack the cubes on the blue plate from bottom to top: orange, blue, green, and yellow.

Crossword

Arranging letter blocks to form intersecting words on a grid, combining world knowledge with fine-grained spatial placement.

In-the-wild Kitchen

Solution Crossword Solution 1

Lab Scene

Solution Crossword Solution 2

In-the-wild Kitchen

Solution Crossword Solution 3

In-the-wild Kitchen

Solution Crossword Solution 4

Fill the numbered slots using the provided blocks to solve the crossword puzzle. You do not need to use all blocks or all slots.

Reorientation

Adjusting object poses to satisfy language-specified orientation constraints, emphasizing spatial understanding beyond 2D placement.

Lab Scene

Pick up the bottles and place them on the plates.

Lab Scene

Pick up the bottles and place them on the plates.

In-the-wild Kitchen

Pick up the bottle and place it on the plate.

In-the-wild Outdoor

Pick up the bottle and place it on the plate.

Kitchen

Long-horizon rearrangement task placing items into context-appropriate containers, requiring sustained grounding and error recovery.

In-the-wild Lobby

Open the pot, put the potato into the bowl, then take out the cup in the top drawer, place it on the plate, and close the drawer.

In-the-wild Lobby

Close the pot, put the spice bottle into the top drawer, and close the drawer.

In-the-wild Outdoor

Put the potato into the pot, close the pot, then put the salt bottle in the top drawer and close the drawer.

In-the-wild Outdoor

Close the pot, put the salt bottle into the top drawer, and close the drawer.

Method

AgenticLab executes a closed-loop agentic reasoning pipeline for manipulation that integrates task parsing, grounding, planning, execution, verification, and replanning. The system operates through iterative perception and action cycles, enabling robust performance in unstructured environments.

AgenticLab Pipeline

Click image to view full resolution

The pipeline uses multi-view RGB-D observations for open-vocabulary grounding, VLM-based reasoning for task decomposition and verification, and primitive-based execution with closed-loop feedback. This modular design allows different VLMs to be swapped in through a unified interface, enabling fair evaluation without model-specific engineering.

VLMs as Robot Agent Benchmarks

1. Can a Single VLM Drive a Robot Agent?

Single-VLM Pipeline Comparison

Single-VLM Pipeline Comparison

We compare single-VLM baselines under the same agent pipeline to isolate model effects.

Failure Mode Breakdown

Failure mode breakdown for single-VLM pipelines on the sorting task.

2. Module Benchmark

Modular Benchmark Radar Chart

Sub-module Performance Comparison

Normalized per-module performance across the pipeline (higher is better). Each axis corresponds to a module score.

Latency Comparison Across VLMs

Latency Comparison

Token Usage Comparison Across VLMs

Token Usage Comparison

3. Compositional Pipeline vs. Single VLM

Compositional vs Single VLM

We compare a Gemini Flash single-VLM baseline with our compositional pipeline across five manipulation tasks. Performance is measured by a task progress score capturing partial completion.

4. Off-the-Shelf VLA vs. AgenticLab

Performance Comparison

VLA vs AgenticLab Performance

We compare a π0.5 VLA fine-tuned with 40 demonstrations for sorting and 30 for stacking against the AgenticLab pipeline on both tasks.

Ablation Study

We analyze two critical components that enable reliable closed-loop robot manipulation in AgenticLab.

Action Checker for Vision-Language
Closed-loop Reasoning

Comparing no checker, goal checker only, and full action checker to evaluate robustness against disturbances.

Grasp Planner for Active Perception

Evaluating how grasp pose verification using local point clouds improves semantic correctness and physical feasibility.

BibTeX


@article{guo2026agenticlab,
  title={AgenticLab: A Real-World Robot Agent Platform that Can See, Think, and Act},
  author={Guo, Pengyuan and Mai, Zhonghao and Xu, Zhengtong and Zhang, Kaidi and Zhang, Heng and Miao, Zichen and Ajoudani, Arash and Kingston, Zachary and Qiu, Qiang and She, Yu},
  journal={arXiv preprint arXiv:2602.01662},
  year={2026}
}