Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models
Authors: Simian Luo, Chuanhao Yan, Chenxu Hu, Hang Zhao
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | DIFF-FOLEY achieves state-of-the-art V2A performance on current large scale V2A dataset. Furthermore, we demonstrate DIFF-FOLEY practical applicability and adaptability via customized downstream finetuning. |
| Researcher Affiliation | Academia | Simian Luo1,2 Chuanhao Yan1 Chenxu Hu1 Hang Zhao1,2 1IIIS, Tsinghua University 2Shanghai Qi Zhi Institute |
| Pseudocode | No | The paper describes its methods in prose and equations but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Project Page: https://diff-foley.github.io/ |
| Open Datasets | Yes | We use two datasets VGGSound [2] and Audio Set [10]. |
| Dataset Splits | No | We follow the original VGGSound train/test splits. |
| Hardware Specification | Yes | We train the CAVP model for 1.4M steps on 8 A100 GPUs, with a total batch size of 720 using automatic mixed-precision training (AMP). |
| Software Dependencies | No | The paper mentions specific models and optimizers (e.g., "SD-V1.4", "Adam W") but does not provide specific version numbers for software libraries or dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | For CAVP, we use a pretrained audio encoder from PANNs [24] and a pretrained Slow Only [9] based video encoder. For training, we randomly extract 4-second audio-video frames pairs from 10-second samples, resulting in xa R256 128 and xv R16 3 224 224. We use temporal contrast LT with NT = 3 and a minimum time difference of 2 seconds between each pair. For LDM training, we utilize the pretrained Stable Diffusion-V1.4 (SD-V1.4) [38] as a powerful denoising prior model. ... We train the CAVP model for 1.4M steps on 8 A100 GPUs, with a total batch size of 720 using automatic mixed-precision training (AMP). We used the Adam W [29] optimizer with a learning rate of 8e-4 and 200 steps of warmup. |