Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

Authors: Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, Vincent Sitzmann

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We extensively evaluate Diffusion Forcing s merits as a generative sequence model across diverse applications in video and time series prediction, planning, and imitation learning.
Researcher Affiliation Academia Boyuan Chen MIT CSAIL boyuanc@mit.edu; Diego Mart ı Mons o Technical University of Munich diego.marti@tum.de; Yilun Du MIT CSAIL yilundu@mit.edu; Max Simchowitz MIT CSAIL msimchow@mit.edu; Russ Tedrake MIT CSAIL russt@mit.edu; Vincent Sitzmann MIT CSAIL sitzmann@mit.edu
Pseudocode Yes Algorithm 1 Diffusion Forcing Training; Algorithm 2 DF Sampling with Guidance
Open Source Code Yes Project website: https://boyuan.space/ diffusion-forcing/; NeurIPS Paper Checklist: Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Code has been released publicly.
Open Datasets Yes We train a convolutional RNN implementation of Causal Diffusion Forcing for video generative modeling on videos of Minecraft gameplay [68] and DMLab navigation [68].; We evaluate our proposed decision-making framework in a standard offline RL benchmark, D4RL [18].; We access the datasets from Gluon TS [2], and set the context and prediction windows to the same length for each dataset.
Dataset Splits Yes We construct a validation set of the same cardinality as the held-out test set as a randomly sampled subset of subsequences from the training set.; The validation set is a random subset of the training set with the same number of sequences as the test set.
Hardware Specification Yes Time series, maze planning, compositionally, and visual imitation experiments can be trained with a single 2080Ti with 11GB of memory. We use 8 A100 GPUs for both video prediction datasets.
Software Dependencies No The paper mentions software like 'pytorch-ts' and 'Gluon TS [2]' but does not provide specific version numbers for these or other key software components or libraries.
Experiment Setup Yes We choose the number of channels in z to be 16 for DMlab and 32 for Minecraft.; We use sigmoid noise schedule [9] for video prediction, linear noise schedule for maze planning, and cosine schedule for everything else.; We train for 50K steps with a batch size of 8 16.