Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Wonderful Team: Zero-Shot Physical Task Planning with Visual LLMs

Authors: Zidan Wang, Rui Shen, Bradly C. Stadie

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our contributions can be summarized as follows: Empirical Validation through Extensive Experiments and Ablation Studies: We validated our framework through extensive tabletop experiments, including long-horizon pick-and-place, sweeping, and fine-grained trajectory planning and generation, demonstrating its effectiveness beyond standard pick-and-place tasks.
Researcher Affiliation	Academia	Zidan Wang EMAIL Department of Statistics, Northwestern University Rui Shen EMAIL Department of Computer Science, University of Virginia Bradly Stadie EMAIL Department of Statistics, Northwestern University
Pseudocode	Yes	The high-level planning phase involves iterative refinement between the Supervisor and Verification agents, as outlined in Algorithm 1. ... Details are shown in Algorithm 2.
Open Source Code	Yes	Demonstration videos of the robotic policies in action, along with the code and prompts, can be accessed on our project website . ... For reproducibility and further experimentation, the full prompts are available in our codebase .
Open Datasets	Yes	To assess our approach s ability to understand multimodal prompts, reason about abstract concepts, and follow constraints, we tested it on all 17 tasks from VIMABench (Jiang et al., 2023).
Dataset Splits	No	The paper does not provide specific training/test/validation dataset splits for reproducibility. It describes evaluation protocols like "Each task was executed in 10 runs, allowing only a single attempt per run." and "Each experiment was conducted over 10 trials with a maximum of 300 steps per trial." for its zero-shot setup, rather than dataset partitioning.
Hardware Specification	No	The paper mentions specific robotic hardware like "UFactory x Arm 7", "Franka Emika Panda robot", and "Intel Real Sense D435 camera" used for real-world experiments. However, it does not specify the computational hardware (e.g., CPU, GPU models, memory) used to run the VLLMs or perform inference/training for the described methods.
Software Dependencies	No	The paper mentions "controlled via the x Arm Controller using Python and ROS" but does not specify version numbers for Python, ROS, or any other software libraries or solvers used.
Experiment Setup	Yes	All experiments were conducted with consistency and rigor to accurately assess our framework s performance. Multimodal Reasoning & Constraint Manipulation: Each task was executed in 10 runs, allowing only a single attempt per run. ... Ambiguous Instruction & Contextual Reasoning: Each task was performed in 2 runs for each of the 4 variations with varying difficulty. ... Spatial Planning & Execution: Each task was carried out in 5 runs under a closed-loop evaluation protocol, permitting up to three replanning attempts. ... Each experiment was conducted over 10 trials with a maximum of 300 steps per trial.