Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models

Authors: Simian Luo, Chuanhao Yan, Chenxu Hu, Hang Zhao

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental DIFF-FOLEY achieves state-of-the-art V2A performance on current large scale V2A dataset. Furthermore, we demonstrate DIFF-FOLEY practical applicability and adaptability via customized downstream finetuning.
Researcher Affiliation Academia Simian Luo1,2 Chuanhao Yan1 Chenxu Hu1 Hang Zhao1,2 1IIIS, Tsinghua University 2Shanghai Qi Zhi Institute
Pseudocode No The paper describes its methods in prose and equations but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes Project Page: https://diff-foley.github.io/
Open Datasets Yes We use two datasets VGGSound [2] and Audio Set [10].
Dataset Splits No We follow the original VGGSound train/test splits.
Hardware Specification Yes We train the CAVP model for 1.4M steps on 8 A100 GPUs, with a total batch size of 720 using automatic mixed-precision training (AMP).
Software Dependencies No The paper mentions specific models and optimizers (e.g., "SD-V1.4", "Adam W") but does not provide specific version numbers for software libraries or dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes For CAVP, we use a pretrained audio encoder from PANNs [24] and a pretrained Slow Only [9] based video encoder. For training, we randomly extract 4-second audio-video frames pairs from 10-second samples, resulting in xa R256 128 and xv R16 3 224 224. We use temporal contrast LT with NT = 3 and a minimum time difference of 2 seconds between each pair. For LDM training, we utilize the pretrained Stable Diffusion-V1.4 (SD-V1.4) [38] as a powerful denoising prior model. ... We train the CAVP model for 1.4M steps on 8 A100 GPUs, with a total batch size of 720 using automatic mixed-precision training (AMP). We used the Adam W [29] optimizer with a learning rate of 8e-4 and 200 steps of warmup.