SF-V: Single Forward Video Generation Model
Authors: Zhixing Zhang, Yanyu Li, Yushu Wu, yanwu xu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Aliaksandr Siarohin, Junli Cao, Dimitris Metaxas, Sergey Tulyakov, Jian Ren
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our method achieves competitive generation quality of synthesized videos with significantly reduced computational overhead for the denoising process (i.e., around 23 speedup compared with SVD and 6 speedup compared with existing works, with even better generation quality), paving the way for real-time video synthesis and editing. In Sec. 4, we conduct comprehensive experiments supporting the main claims made in the abstract and introduction. |
| Researcher Affiliation | Collaboration | 1Snap Inc. 2 Rutgers University |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | As we mentioned in the abstract, we plan to release the code and pre-trained models with sufficient instructions to faithfully reproduce our results. |
| Open Datasets | Yes | All the experiments are conducted on an internal video dataset with around one million videos. ... We use the first frame from the UCF-101 dataset [62] as the conditioning input and generate 14-frame videos at a resolution of 1024 576 at 7 FPS for all methods. |
| Dataset Splits | No | The paper mentions using an 'internal video dataset' for training and UCF-101 for evaluation, but it does not specify explicit train/validation/test splits, percentages, or sample counts for these datasets. |
| Hardware Specification | Yes | The training is conducted for 50K iterations on 8 NVIDIA A100 GPUs, using the SM3 optimizer [59] with a learning rate of 1e 5 for the generator (i.e., UNet) and 1e 4 for the discriminator. ... We also report the latency of the denoising process for each setting, measured on a single NVIDIA A100 GPU. |
| Software Dependencies | No | The paper mentions using 'SM3 optimizer [59]' but does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | We fix the resolution of the training videos as 768 448 with the FPS as 7. The training is conducted for 50K iterations on 8 NVIDIA A100 GPUs, using the SM3 optimizer [59] with a learning rate of 1e 5 for the generator (i.e., UNet) and 1e 4 for the discriminator. We set the momentum and β for both optimizers as 0.5 and 0.999, respectively. The total batch size is set as 32 using a 4 steps gradient accumulation. We set the EMA rate as 0.95. We set Pmean = 1, Pstd = 1, and λ = 0.1 if not otherwise noted. At inference time, we sample videos at resolution of 1024 576. |