Learning Universal Policies via Text-Guided Video Generation

Authors: Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, Pieter Abbeel

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The focus of these experiments is to evaluate Uni Pi in terms of its ability to enable effective, generalizable decision making. In particular, we evaluate (1) the ability to combinatorially generalize across different subgoals in Section 4.1, (2) the ability to effectively learn and generalize across many tasks in Section 4.2, (3) the ability to leverage existing videos on the internet to generalize to complex tasks in Section 4.3. See experimental details in Appendix A. Additional results are given in Appendix B and videos in the supplement.
Researcher Affiliation Collaboration Yilun Du , Mengjiao Yang * , Bo Dai , Hanjun Dai , Ofir Nachum , Joshua B. Tenenbaum , Dale Schuurmans ||, Pieter Abbeel MIT Google Deep Mind UC Berkeley Georgia Tech University of Alberta|| https://universal-policy.github.io/ Correspondence to yilundu@mit.edu and sherryy@berkeley.edu.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper provides a link for "video visualizations" (https://universal-policy.github.io/) but does not explicitly state that the source code for the methodology described in the paper is available at this link or elsewhere.
Open Datasets Yes Our training data consists of an internet-scale pretraining dataset and a smaller real-world robotic dataset. The pretraining dataset uses the same data as [19], which consists of 14 million video-text pairs, 60 million image-text pairs, and the publicly available LAION-400M image-text dataset. The robotic dataset is adopted from the Bridge dataset [29] with 7.2k video-text pairs, where we use the task IDs as texts.
Dataset Splits No We partition the 7.2k video-text pairs into train (80%) and test (20%) splits. (This specifies train/test but does not mention a validation split, nor does it provide full split details for all datasets used for reproducibility).
Hardware Specification Yes We use 256 TPU-v4 chips for our first-frame conditioned generation model and temporal super resolution model.
Software Dependencies No We use T5-XXL [22] to process input prompts which consists of 4.6 billion parameters. [...] The inverse dynamics model is trained using the Adam optimizer with gradient norm clipped at 1 and learning rate 1e-4 for a total of 2M steps where linear warmup is applied to the first 10k steps. (While specific models like T5-XXL and Adam optimizer are mentioned, no version numbers for general software dependencies like Python, PyTorch/TensorFlow, or other libraries are provided.)
Experiment Setup Yes We train each of our video diffusion models for 2M steps using batch size 2048 with learning rate 1e-4 and 10k linear warmup steps. [...] The inverse dynamics model is trained using the Adam optimizer with gradient norm clipped at 1 and learning rate 1e-4 for a total of 2M steps where linear warmup is applied to the first 10k steps.