SF-V: Single Forward Video Generation Model

Authors: Zhixing Zhang, Yanyu Li, Yushu Wu, yanwu xu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Aliaksandr Siarohin, Junli Cao, Dimitris Metaxas, Sergey Tulyakov, Jian Ren

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our method achieves competitive generation quality of synthesized videos with significantly reduced computational overhead for the denoising process (i.e., around 23 speedup compared with SVD and 6 speedup compared with existing works, with even better generation quality), paving the way for real-time video synthesis and editing. In Sec. 4, we conduct comprehensive experiments supporting the main claims made in the abstract and introduction.
Researcher Affiliation Collaboration 1Snap Inc. 2 Rutgers University
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No As we mentioned in the abstract, we plan to release the code and pre-trained models with sufficient instructions to faithfully reproduce our results.
Open Datasets Yes All the experiments are conducted on an internal video dataset with around one million videos. ... We use the first frame from the UCF-101 dataset [62] as the conditioning input and generate 14-frame videos at a resolution of 1024 576 at 7 FPS for all methods.
Dataset Splits No The paper mentions using an 'internal video dataset' for training and UCF-101 for evaluation, but it does not specify explicit train/validation/test splits, percentages, or sample counts for these datasets.
Hardware Specification Yes The training is conducted for 50K iterations on 8 NVIDIA A100 GPUs, using the SM3 optimizer [59] with a learning rate of 1e 5 for the generator (i.e., UNet) and 1e 4 for the discriminator. ... We also report the latency of the denoising process for each setting, measured on a single NVIDIA A100 GPU.
Software Dependencies No The paper mentions using 'SM3 optimizer [59]' but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup Yes We fix the resolution of the training videos as 768 448 with the FPS as 7. The training is conducted for 50K iterations on 8 NVIDIA A100 GPUs, using the SM3 optimizer [59] with a learning rate of 1e 5 for the generator (i.e., UNet) and 1e 4 for the discriminator. We set the momentum and β for both optimizers as 0.5 and 0.999, respectively. The total batch size is set as 32 using a 4 steps gradient accumulation. We set the EMA rate as 0.95. We set Pmean = 1, Pstd = 1, and λ = 0.1 if not otherwise noted. At inference time, we sample videos at resolution of 1024 576.