reproducibilityindex.ai

Video Language Planning

Authors: Yilun Du, Sherry Yang, Pete Florence, Fei Xia, Ayzaan Wahid, brian ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B. Tenenbaum, Leslie Pack Kaelbling, Andy Zeng, Jonathan Tompson

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that VLP substantially improves long-horizon task success rates compared to prior methods on both simulated and real robots (across 3 hardware platforms).
Researcher Affiliation	Collaboration	Google Deepmind , Massachusetts Institute of Technology , UC Berkeley
Pseudocode	Yes	Algorithm 1 Decision Making with VLP
Open Source Code	Yes	https://video-language-planning.github.io/
Open Datasets	Yes	We trained VLP on approximately 10000 long horizon trajectories in both simulation and real across a set of several hundred different long horizon goals. [...] Bridge (Ebert et al., 2021), RT-2 (Brohan et al., 2023), Ego4D (Grauman et al., 2022), EPIC-KITCHEN (Damen et al., 2018), and LAION-400M (Schuhmann et al., 2022).
Dataset Splits	No	The paper mentions using several datasets but does not provide specific train/validation/test split percentages, sample counts, or a detailed methodology for splitting their own gathered data, only for evaluation.
Hardware Specification	Yes	We use a base channel width of 256 across models and train a base text-conditioned video model using 64 TPUv3 pods for 3 days and higher resolution superresolution models for 1 day.
Software Dependencies	No	The paper mentions specific models (e.g., Pa LM-E, LAVA) and general software categories, but does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	To generate video plans, we planned with a horizon of 16, a beam width of 2, a language branching factor of 4, and a video branching factor of 4. To enable fast video generation, we used the DDIM sampler, with a total of 64 timesteps of sampling at the base resolution and 4 timesteps of sampling at the higher resolution samples, with a classifier-free guidance scale of 5 for the base model. We queried the VLM policy to generate different text actions given an image with a temperature of 0.3. Our VLM heuristic function decoded the number of steps left until task-completion with a temperature of 0.0. We set our heuristic function clipping threshold during planning to be 50 and removed videos if the improvement was larger than 50 after one video rollout.