Position: Video as the New Language for Real-World Decision Making
Authors: Sherry Yang, Jacob C Walker, Jack Parker-Holder, Yilun Du, Jake Bruce, Andre Barreto, Pieter Abbeel, Dale Schuurmans
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To further illustrate how video generation can have a profound impact on real-world applications, we provide an in depth analysis on recent work that utilizes video generation as task solvers, answers to questions, policies/agents, and environment simulators through techniques such as instruction tuning, in-context learning, planning, and reinforcement learning in settings such as games, robotics, self-driving, and science. Details of the models used to generate the examples can be found in Appendix A. Additional generated videos can be found in Appendix B. |
| Researcher Affiliation | Collaboration | 1Google Deep Mind 2UC Berkeley 3MIT. |
| Pseudocode | No | The paper describes model architectures and processes in narrative text and references existing models but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not include an unambiguous statement that the authors are releasing their code for the work described in this paper, nor does it provide a direct link to a source-code repository. |
| Open Datasets | Yes | We used the contractor data from Baker et al. (2022) and training a video generation model on the Open X-Embodiment dataset (Padalkar et al., 2023) and STEM data collected from Schwarzer et al. (2023). |
| Dataset Splits | No | The paper uses various datasets for its discussed examples but does not specify exact training, validation, and test splits (e.g., percentages, sample counts, or explicit standard splits with citations) needed to reproduce the data partitioning. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, memory amounts, or detailed computer specifications used for running its experiments or generating examples. |
| Software Dependencies | No | The paper references various models and architectures but does not list specific software dependencies with their version numbers (e.g., PyTorch 1.9, CUDA 11.1) needed to replicate the experiment. |
| Experiment Setup | Yes | the lower resolution video generation model operates at resolution [24, 40], followed by two spacial super-resolution models with target resolution [48, 80] and [192, 320]. Classifier-free guidance (Ho & Salimans, 2022) was applied for text or action conditioning. ... Our Mask GIT implementation uses 8 steps with a cosine masking schedule. |