reproducibilityindex.ai

Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching

Authors: Yongqi Wang, Wenxiang Guo, Rongjie Huang, Jiawei Huang, Zehan Wang, Fuming You, Ruiqi Li, Zhou Zhao

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments indicate that FRIEREN achieves state-of-the-art performance in both generation quality and temporal alignment on VGGSound, with alignment accuracy reaching 97.22%, and 6.2% improvement in inception score over the strong diffusion-based baseline.
Researcher Affiliation	Academia	Yongqi Wang , Wenxiang Guo , Rongjie Huang, Jiawei Huang, Zehan Wang, Fuming You, Ruiqi Li, Zhou Zhao Zhejiang University cyanbox@zju.edu.cn
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	Due to time limiations, we have not yet organized a publicly shareable version of the code. We will open-source our code after the paper is accepted.
Open Datasets	Yes	Following most previous works, we take VGGSound [2] as the benchmark, which consists of 200k+ 10-second video clips from You Tube spanning 309 categories.
Dataset Splits	No	We follow the original train and test splits of VGGSound, the sizes of which are about 182.6k and 15.3k. The paper specifies train and test splits but does not provide explicit details for a validation split (e.g., percentages or sample counts).
Hardware Specification	Yes	Each model is trained with 2 NVIDIA RTX-4090 GPUs.
Software Dependencies	No	The paper mentions using specific models and solvers (e.g., 'Big VGAN [17] vocoder', 'DPM-Solver [24]'), but does not provide specific version numbers for software dependencies like programming languages or libraries (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	The transformer of the vector field estimator mainly used in the experiments has 4 layers and a hidden dimension of 576. ... We train the estimator for 1.3M steps for the first training, and 600k and 500k steps for reflow and distillation, with the learning rate being 5e-5 for all stages. ... We set γ to 4.5 in our major experiments.