reproducibilityindex.ai

Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models

Authors: Simian Luo, Chuanhao Yan, Chenxu Hu, Hang Zhao

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	DIFF-FOLEY achieves state-of-the-art V2A performance on current large scale V2A dataset. Furthermore, we demonstrate DIFF-FOLEY practical applicability and adaptability via customized downstream finetuning.
Researcher Affiliation	Academia	Simian Luo1,2 Chuanhao Yan1 Chenxu Hu1 Hang Zhao1,2 1IIIS, Tsinghua University 2Shanghai Qi Zhi Institute
Pseudocode	No	The paper describes its methods in prose and equations but does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Project Page: https://diff-foley.github.io/
Open Datasets	Yes	We use two datasets VGGSound [2] and Audio Set [10].
Dataset Splits	No	We follow the original VGGSound train/test splits.
Hardware Specification	Yes	We train the CAVP model for 1.4M steps on 8 A100 GPUs, with a total batch size of 720 using automatic mixed-precision training (AMP).
Software Dependencies	No	The paper mentions specific models and optimizers (e.g., "SD-V1.4", "Adam W") but does not provide specific version numbers for software libraries or dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	For CAVP, we use a pretrained audio encoder from PANNs [24] and a pretrained Slow Only [9] based video encoder. For training, we randomly extract 4-second audio-video frames pairs from 10-second samples, resulting in xa R256 128 and xv R16 3 224 224. We use temporal contrast LT with NT = 3 and a minimum time difference of 2 seconds between each pair. For LDM training, we utilize the pretrained Stable Diffusion-V1.4 (SD-V1.4) [38] as a powerful denoising prior model. ... We train the CAVP model for 1.4M steps on 8 A100 GPUs, with a total batch size of 720 using automatic mixed-precision training (AMP). We used the Adam W [29] optimizer with a learning rate of 8e-4 and 200 steps of warmup.