Learning Universal Policies via Text-Guided Video Generation
Authors: Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, Pieter Abbeel
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The focus of these experiments is to evaluate Uni Pi in terms of its ability to enable effective, generalizable decision making. In particular, we evaluate (1) the ability to combinatorially generalize across different subgoals in Section 4.1, (2) the ability to effectively learn and generalize across many tasks in Section 4.2, (3) the ability to leverage existing videos on the internet to generalize to complex tasks in Section 4.3. See experimental details in Appendix A. Additional results are given in Appendix B and videos in the supplement. |
| Researcher Affiliation | Collaboration | Yilun Du , Mengjiao Yang * , Bo Dai , Hanjun Dai , Ofir Nachum , Joshua B. Tenenbaum , Dale Schuurmans ||, Pieter Abbeel MIT Google Deep Mind UC Berkeley Georgia Tech University of Alberta|| https://universal-policy.github.io/ Correspondence to yilundu@mit.edu and sherryy@berkeley.edu. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides a link for "video visualizations" (https://universal-policy.github.io/) but does not explicitly state that the source code for the methodology described in the paper is available at this link or elsewhere. |
| Open Datasets | Yes | Our training data consists of an internet-scale pretraining dataset and a smaller real-world robotic dataset. The pretraining dataset uses the same data as [19], which consists of 14 million video-text pairs, 60 million image-text pairs, and the publicly available LAION-400M image-text dataset. The robotic dataset is adopted from the Bridge dataset [29] with 7.2k video-text pairs, where we use the task IDs as texts. |
| Dataset Splits | No | We partition the 7.2k video-text pairs into train (80%) and test (20%) splits. (This specifies train/test but does not mention a validation split, nor does it provide full split details for all datasets used for reproducibility). |
| Hardware Specification | Yes | We use 256 TPU-v4 chips for our first-frame conditioned generation model and temporal super resolution model. |
| Software Dependencies | No | We use T5-XXL [22] to process input prompts which consists of 4.6 billion parameters. [...] The inverse dynamics model is trained using the Adam optimizer with gradient norm clipped at 1 and learning rate 1e-4 for a total of 2M steps where linear warmup is applied to the first 10k steps. (While specific models like T5-XXL and Adam optimizer are mentioned, no version numbers for general software dependencies like Python, PyTorch/TensorFlow, or other libraries are provided.) |
| Experiment Setup | Yes | We train each of our video diffusion models for 2M steps using batch size 2048 with learning rate 1e-4 and 10k linear warmup steps. [...] The inverse dynamics model is trained using the Adam optimizer with gradient norm clipped at 1 and learning rate 1e-4 for a total of 2M steps where linear warmup is applied to the first 10k steps. |