Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Wonderful Team: Zero-Shot Physical Task Planning with Visual LLMs

Authors: Zidan Wang, Rui Shen, Bradly C. Stadie

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our contributions can be summarized as follows: Empirical Validation through Extensive Experiments and Ablation Studies: We validated our framework through extensive tabletop experiments, including long-horizon pick-and-place, sweeping, and fine-grained trajectory planning and generation, demonstrating its effectiveness beyond standard pick-and-place tasks.
Researcher Affiliation Academia Zidan Wang EMAIL Department of Statistics, Northwestern University Rui Shen EMAIL Department of Computer Science, University of Virginia Bradly Stadie EMAIL Department of Statistics, Northwestern University
Pseudocode Yes The high-level planning phase involves iterative refinement between the Supervisor and Verification agents, as outlined in Algorithm 1. ... Details are shown in Algorithm 2.
Open Source Code Yes Demonstration videos of the robotic policies in action, along with the code and prompts, can be accessed on our project website . ... For reproducibility and further experimentation, the full prompts are available in our codebase .
Open Datasets Yes To assess our approach s ability to understand multimodal prompts, reason about abstract concepts, and follow constraints, we tested it on all 17 tasks from VIMABench (Jiang et al., 2023).
Dataset Splits No The paper does not provide specific training/test/validation dataset splits for reproducibility. It describes evaluation protocols like "Each task was executed in 10 runs, allowing only a single attempt per run." and "Each experiment was conducted over 10 trials with a maximum of 300 steps per trial." for its zero-shot setup, rather than dataset partitioning.
Hardware Specification No The paper mentions specific robotic hardware like "UFactory x Arm 7", "Franka Emika Panda robot", and "Intel Real Sense D435 camera" used for real-world experiments. However, it does not specify the computational hardware (e.g., CPU, GPU models, memory) used to run the VLLMs or perform inference/training for the described methods.
Software Dependencies No The paper mentions "controlled via the x Arm Controller using Python and ROS" but does not specify version numbers for Python, ROS, or any other software libraries or solvers used.
Experiment Setup Yes All experiments were conducted with consistency and rigor to accurately assess our framework s performance. Multimodal Reasoning & Constraint Manipulation: Each task was executed in 10 runs, allowing only a single attempt per run. ... Ambiguous Instruction & Contextual Reasoning: Each task was performed in 2 runs for each of the 4 variations with varying difficulty. ... Spatial Planning & Execution: Each task was carried out in 5 runs under a closed-loop evaluation protocol, permitting up to three replanning attempts. ... Each experiment was conducted over 10 trials with a maximum of 300 steps per trial.