Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion
Authors: Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, Vincent Sitzmann
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We extensively evaluate Diffusion Forcing s merits as a generative sequence model across diverse applications in video and time series prediction, planning, and imitation learning. |
| Researcher Affiliation | Academia | Boyuan Chen MIT CSAIL boyuanc@mit.edu; Diego Mart ı Mons o Technical University of Munich diego.marti@tum.de; Yilun Du MIT CSAIL yilundu@mit.edu; Max Simchowitz MIT CSAIL msimchow@mit.edu; Russ Tedrake MIT CSAIL russt@mit.edu; Vincent Sitzmann MIT CSAIL sitzmann@mit.edu |
| Pseudocode | Yes | Algorithm 1 Diffusion Forcing Training; Algorithm 2 DF Sampling with Guidance |
| Open Source Code | Yes | Project website: https://boyuan.space/ diffusion-forcing/; NeurIPS Paper Checklist: Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Code has been released publicly. |
| Open Datasets | Yes | We train a convolutional RNN implementation of Causal Diffusion Forcing for video generative modeling on videos of Minecraft gameplay [68] and DMLab navigation [68].; We evaluate our proposed decision-making framework in a standard offline RL benchmark, D4RL [18].; We access the datasets from Gluon TS [2], and set the context and prediction windows to the same length for each dataset. |
| Dataset Splits | Yes | We construct a validation set of the same cardinality as the held-out test set as a randomly sampled subset of subsequences from the training set.; The validation set is a random subset of the training set with the same number of sequences as the test set. |
| Hardware Specification | Yes | Time series, maze planning, compositionally, and visual imitation experiments can be trained with a single 2080Ti with 11GB of memory. We use 8 A100 GPUs for both video prediction datasets. |
| Software Dependencies | No | The paper mentions software like 'pytorch-ts' and 'Gluon TS [2]' but does not provide specific version numbers for these or other key software components or libraries. |
| Experiment Setup | Yes | We choose the number of channels in z to be 16 for DMlab and 32 for Minecraft.; We use sigmoid noise schedule [9] for video prediction, linear noise schedule for maze planning, and cosine schedule for everything else.; We train for 50K steps with a batch size of 8 16. |