Generating Long Videos of Dynamic Scenes

Authors: Tim Brooks, Janne Hellsten, Miika Aittala, Ting-Chun Wang, Timo Aila, Jaakko Lehtinen, Ming-Yu Liu, Alexei Efros, Tero Karras

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present a video generation model that accurately reproduces object motion, changes in camera viewpoint, and new content that arises over time. Existing video generation methods often fail to produce new content as a function of time while maintaining consistencies expected in real environments, such as plausible dynamics and object persistence. A common failure case is for content to never change due to over-reliance on inductive biases to provide temporal consistency, such as a single latent code that dictates content for the entire video. On the other extreme, without long-term consistency, generated videos may morph unrealistically between different scenes. To address these limitations, we prioritize the time axis by redesigning the temporal latent representation and learning long-term consistency from data by training on longer videos. We leverage a two-phase training strategy, where we separately train using longer videos at a low resolution and shorter videos at a high resolution. To evaluate the capabilities of our model, we introduce two new benchmark datasets with explicit focus on long-term temporal dynamics.
Researcher Affiliation Collaboration Tim Brooks NVIDIA, UC Berkeley Janne Hellsten NVIDIA Miika Aittala NVIDIA Ting-Chun Wang NVIDIA Timo Aila NVIDIA Jaakko Lehtinen NVIDIA, Aalto University Ming-Yu Liu NVIDIA Alexei A. Efros UC Berkeley Tero Karras NVIDIA
Pseudocode No The paper presents architectural diagrams (e.g., Figure 2, Figure 3) but does not include any pseudocode or algorithm blocks.
Open Source Code Yes See our webpage1 for video results, code, data and pretrained models.
Open Datasets Yes To best evaluate our model, we introduce two new video datasets of first-person mountain biking and horseback riding (Figure 4a,b) that exhibit complex changes over time. Our new datasets include subject motion of the horse or biker, a first-person camera viewpoint that moves through space, and new scenery and objects over time. The videos are available in high definition and were manually trimmed to remove problematic segments, scene cuts, text overlays, obstructed views, etc. The mountain biking dataset has 1202 videos with a median duration of 330 frames at 30 fps, and the horseback dataset has 66 videos with a median duration of 6504 frames also at 30fps. We have permission from the content owners to publicly release the datasets for research purposes. We believe our new datasets will serve as important benchmarks for future work.
Dataset Splits No The paper mentions training with sequences of 128 frames and sampling from training data but does not specify explicit train/validation/test splits (e.g., percentages or counts) for the datasets themselves.
Hardware Specification Yes Our project consumed 300MWh on an in-house cluster of V100 and A100 GPUs.
Software Dependencies No The paper does not list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes We employ Diff Aug [69] using the same transformation for each frame in a sequence, as well as fractional time stretching between 1/2 and 2 ; see Appendix C.1 for details. ... We train at 256 144 resolution on these datasets to preserve the aspect ratio. ... We found it necessary to increase the R1γ hyperparameter by 10 to produce good results with Style GAN-V on our new datasets that exhibit complex changes over time. ... Our low-resolution generator... is trained with sequences of 128 frames at 642 resolution.