Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Wonderful Team: Zero-Shot Physical Task Planning with Visual LLMs
Authors: Zidan Wang, Rui Shen, Bradly C. Stadie
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our contributions can be summarized as follows: Empirical Validation through Extensive Experiments and Ablation Studies: We validated our framework through extensive tabletop experiments, including long-horizon pick-and-place, sweeping, and fine-grained trajectory planning and generation, demonstrating its effectiveness beyond standard pick-and-place tasks. |
| Researcher Affiliation | Academia | Zidan Wang EMAIL Department of Statistics, Northwestern University Rui Shen EMAIL Department of Computer Science, University of Virginia Bradly Stadie EMAIL Department of Statistics, Northwestern University |
| Pseudocode | Yes | The high-level planning phase involves iterative refinement between the Supervisor and Verification agents, as outlined in Algorithm 1. ... Details are shown in Algorithm 2. |
| Open Source Code | Yes | Demonstration videos of the robotic policies in action, along with the code and prompts, can be accessed on our project website . ... For reproducibility and further experimentation, the full prompts are available in our codebase . |
| Open Datasets | Yes | To assess our approach s ability to understand multimodal prompts, reason about abstract concepts, and follow constraints, we tested it on all 17 tasks from VIMABench (Jiang et al., 2023). |
| Dataset Splits | No | The paper does not provide specific training/test/validation dataset splits for reproducibility. It describes evaluation protocols like "Each task was executed in 10 runs, allowing only a single attempt per run." and "Each experiment was conducted over 10 trials with a maximum of 300 steps per trial." for its zero-shot setup, rather than dataset partitioning. |
| Hardware Specification | No | The paper mentions specific robotic hardware like "UFactory x Arm 7", "Franka Emika Panda robot", and "Intel Real Sense D435 camera" used for real-world experiments. However, it does not specify the computational hardware (e.g., CPU, GPU models, memory) used to run the VLLMs or perform inference/training for the described methods. |
| Software Dependencies | No | The paper mentions "controlled via the x Arm Controller using Python and ROS" but does not specify version numbers for Python, ROS, or any other software libraries or solvers used. |
| Experiment Setup | Yes | All experiments were conducted with consistency and rigor to accurately assess our framework s performance. Multimodal Reasoning & Constraint Manipulation: Each task was executed in 10 runs, allowing only a single attempt per run. ... Ambiguous Instruction & Contextual Reasoning: Each task was performed in 2 runs for each of the 4 variations with varying difficulty. ... Spatial Planning & Execution: Each task was carried out in 5 runs under a closed-loop evaluation protocol, permitting up to three replanning attempts. ... Each experiment was conducted over 10 trials with a maximum of 300 steps per trial. |