iVideoGPT: Interactive VideoGPTs are Scalable World Models
Authors: Jialong Wu, Shaofeng Yin, Ningya Feng, Xu He, Dong Li, Jianye Hao, Mingsheng Long
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate i Video GPT in three different control-relevant settings and compare its performance with prior state-of-the-art methods. We demonstrate that i Video GPT is versatile to provide competitive performance across a range of tasks (Section 4.1, 4.2, and 4.3) and conduct in-depth analysis to understand the tokenization and prediction ability, data efficiency, model scaling, and computational efficiency (Section 4.4). Experimental details can be found in Appendix A. |
| Researcher Affiliation | Collaboration | Jialong Wu1 , Shaofeng Yin1,2 , Ningya Feng1, Xu He3, Dong Li3, Jianye Hao3,4, Mingsheng Long1 1School of Software, BNRist, Tsinghua University, 2Zhili College, Tsinghua University 3Huawei Noah s Ark Lab, 4College of Intelligence and Computing, Tianjin University |
| Pseudocode | Yes | Algorithm 1 Model-Based Policy Optimization (MBPO), adapted from [42] |
| Open Source Code | Yes | Code and pre-trained models are available at https://thuml.github.io/i Video GPT. |
| Open Datasets | Yes | We leverage a mixture of 35 datasets from the Open X-Embodiment (OXE) dataset [70] and the Something Something v2 (SSv2) dataset [25], totaling 1.4 million trajectories (see Appendix A.2 for details). |
| Dataset Splits | Yes | We select 1% of samples from each subset as validation data and use the rest for training. For SSv2, we manually select 95 classes with clear motion trends from the original 174 video classes as our pre-training data with a weighting of 15%. We use the official splits of SSv2 for training and validation. |
| Hardware Specification | Yes | Our models are trained and evaluated on an A100 and RTX 3090 GPU cluster. Each experiment utilizes 4 GPUs in parallel, with 16 data loader workers per device. GPU days required for training are reported in Table 2. Experiments at 64 64 resolution can be conducted with 24 GB of GPU memory per device, while 256 256 resolution requires 40 GB. |
| Software Dependencies | No | The paper mentions using "Py Torch, using the diffusers8 and transformers9 libraries" but does not specify version numbers for these software components. |
| Experiment Setup | Yes | The main hyperparameters of our experiment are detailed in Tables 2, 3, and 5. Table 3: Hyperparameters of i Video GPT training and evaluation. Table 5: Hyperparameters of model-based RL with i Video GPT. |