Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching
Authors: Yongqi Wang, Wenxiang Guo, Rongjie Huang, Jiawei Huang, Zehan Wang, Fuming You, Ruiqi Li, Zhou Zhao
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments indicate that FRIEREN achieves state-of-the-art performance in both generation quality and temporal alignment on VGGSound, with alignment accuracy reaching 97.22%, and 6.2% improvement in inception score over the strong diffusion-based baseline. |
| Researcher Affiliation | Academia | Yongqi Wang , Wenxiang Guo , Rongjie Huang, Jiawei Huang, Zehan Wang, Fuming You, Ruiqi Li, Zhou Zhao Zhejiang University cyanbox@zju.edu.cn |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | Due to time limiations, we have not yet organized a publicly shareable version of the code. We will open-source our code after the paper is accepted. |
| Open Datasets | Yes | Following most previous works, we take VGGSound [2] as the benchmark, which consists of 200k+ 10-second video clips from You Tube spanning 309 categories. |
| Dataset Splits | No | We follow the original train and test splits of VGGSound, the sizes of which are about 182.6k and 15.3k. The paper specifies train and test splits but does not provide explicit details for a validation split (e.g., percentages or sample counts). |
| Hardware Specification | Yes | Each model is trained with 2 NVIDIA RTX-4090 GPUs. |
| Software Dependencies | No | The paper mentions using specific models and solvers (e.g., 'Big VGAN [17] vocoder', 'DPM-Solver [24]'), but does not provide specific version numbers for software dependencies like programming languages or libraries (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | The transformer of the vector field estimator mainly used in the experiments has 4 layers and a hidden dimension of 576. ... We train the estimator for 1.3M steps for the first training, and 600k and 500k steps for reflow and distillation, with the learning rate being 5e-5 for all stages. ... We set γ to 4.5 in our major experiments. |