Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Video Language Planning

Authors: Yilun Du, Sherry Yang, Pete Florence, Fei Xia, Ayzaan Wahid, brian ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B. Tenenbaum, Leslie Pack Kaelbling, Andy Zeng, Jonathan Tompson

ICLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that VLP substantially improves long-horizon task success rates compared to prior methods on both simulated and real robots (across 3 hardware platforms).
Researcher Affiliation Collaboration Google Deepmind , Massachusetts Institute of Technology , UC Berkeley
Pseudocode Yes Algorithm 1 Decision Making with VLP
Open Source Code Yes https://video-language-planning.github.io/
Open Datasets Yes We trained VLP on approximately 10000 long horizon trajectories in both simulation and real across a set of several hundred different long horizon goals. [...] Bridge (Ebert et al., 2021), RT-2 (Brohan et al., 2023), Ego4D (Grauman et al., 2022), EPIC-KITCHEN (Damen et al., 2018), and LAION-400M (Schuhmann et al., 2022).
Dataset Splits No The paper mentions using several datasets but does not provide specific train/validation/test split percentages, sample counts, or a detailed methodology for splitting their own gathered data, only for evaluation.
Hardware Specification Yes We use a base channel width of 256 across models and train a base text-conditioned video model using 64 TPUv3 pods for 3 days and higher resolution superresolution models for 1 day.
Software Dependencies No The paper mentions specific models (e.g., Pa LM-E, LAVA) and general software categories, but does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes To generate video plans, we planned with a horizon of 16, a beam width of 2, a language branching factor of 4, and a video branching factor of 4. To enable fast video generation, we used the DDIM sampler, with a total of 64 timesteps of sampling at the base resolution and 4 timesteps of sampling at the higher resolution samples, with a classifier-free guidance scale of 5 for the base model. We queried the VLM policy to generate different text actions given an image with a temperature of 0.3. Our VLM heuristic function decoded the number of steps left until task-completion with a temperature of 0.0. We set our heuristic function clipping threshold during planning to be 50 and removed videos if the improvement was larger than 50 after one video rollout.