Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching

Authors: Yongqi Wang, Wenxiang Guo, Rongjie Huang, Jiawei Huang, Zehan Wang, Fuming You, Ruiqi Li, Zhou Zhao

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments indicate that FRIEREN achieves state-of-the-art performance in both generation quality and temporal alignment on VGGSound, with alignment accuracy reaching 97.22%, and 6.2% improvement in inception score over the strong diffusion-based baseline.
Researcher Affiliation Academia Yongqi Wang , Wenxiang Guo , Rongjie Huang, Jiawei Huang, Zehan Wang, Fuming You, Ruiqi Li, Zhou Zhao Zhejiang University cyanbox@zju.edu.cn
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No Due to time limiations, we have not yet organized a publicly shareable version of the code. We will open-source our code after the paper is accepted.
Open Datasets Yes Following most previous works, we take VGGSound [2] as the benchmark, which consists of 200k+ 10-second video clips from You Tube spanning 309 categories.
Dataset Splits No We follow the original train and test splits of VGGSound, the sizes of which are about 182.6k and 15.3k. The paper specifies train and test splits but does not provide explicit details for a validation split (e.g., percentages or sample counts).
Hardware Specification Yes Each model is trained with 2 NVIDIA RTX-4090 GPUs.
Software Dependencies No The paper mentions using specific models and solvers (e.g., 'Big VGAN [17] vocoder', 'DPM-Solver [24]'), but does not provide specific version numbers for software dependencies like programming languages or libraries (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes The transformer of the vector field estimator mainly used in the experiments has 4 layers and a hidden dimension of 576. ... We train the estimator for 1.3M steps for the first training, and 600k and 500k steps for reflow and distillation, with the learning rate being 5e-5 for all stages. ... We set γ to 4.5 in our major experiments.