Position: Video as the New Language for Real-World Decision Making

Authors: Sherry Yang, Jacob C Walker, Jack Parker-Holder, Yilun Du, Jake Bruce, Andre Barreto, Pieter Abbeel, Dale Schuurmans

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To further illustrate how video generation can have a profound impact on real-world applications, we provide an in depth analysis on recent work that utilizes video generation as task solvers, answers to questions, policies/agents, and environment simulators through techniques such as instruction tuning, in-context learning, planning, and reinforcement learning in settings such as games, robotics, self-driving, and science. Details of the models used to generate the examples can be found in Appendix A. Additional generated videos can be found in Appendix B.
Researcher Affiliation Collaboration 1Google Deep Mind 2UC Berkeley 3MIT.
Pseudocode No The paper describes model architectures and processes in narrative text and references existing models but does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not include an unambiguous statement that the authors are releasing their code for the work described in this paper, nor does it provide a direct link to a source-code repository.
Open Datasets Yes We used the contractor data from Baker et al. (2022) and training a video generation model on the Open X-Embodiment dataset (Padalkar et al., 2023) and STEM data collected from Schwarzer et al. (2023).
Dataset Splits No The paper uses various datasets for its discussed examples but does not specify exact training, validation, and test splits (e.g., percentages, sample counts, or explicit standard splits with citations) needed to reproduce the data partitioning.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, memory amounts, or detailed computer specifications used for running its experiments or generating examples.
Software Dependencies No The paper references various models and architectures but does not list specific software dependencies with their version numbers (e.g., PyTorch 1.9, CUDA 11.1) needed to replicate the experiment.
Experiment Setup Yes the lower resolution video generation model operates at resolution [24, 40], followed by two spacial super-resolution models with target resolution [48, 80] and [192, 320]. Classifier-free guidance (Ho & Salimans, 2022) was applied for text or action conditioning. ... Our Mask GIT implementation uses 8 steps with a cosine masking schedule.