V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models
Authors: Heng Wang, Jianbo Ma, Santiago Pascual, Richard Cartwright, Weidong Cai
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We further analyze and conduct extensive experiments on the choice of the V2A-Mapper and show that a generative mapper is better at fidelity and variability (FD) while a regression mapper is slightly better at relevance (CS). Both objective and subjective evaluation on two V2A datasets demonstrate the superiority of our proposed method compared to current state-of-the-art approaches trained with 86% fewer parameters but achieving 53% and 19% improvement in FD and CS, respectively. |
| Researcher Affiliation | Collaboration | Heng Wang1*, Jianbo Ma2, Santiago Pascual2, Richard Cartwright2, Weidong Cai1 1University of Sydney 2Dolby Laboratories {heng.wang, tom.cai}@sydney.edu.au, {jianbo.ma, santiago.pascual, richard.cartwright}@dolby.com |
| Pseudocode | No | The paper does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | Supplementary materials such as audio samples are provided at our demo website: https://v2a-mapper.github.io/. This website contains audio samples but does not explicitly provide the source code for the methodology. |
| Open Datasets | Yes | We train our V2A-Mapper and all the variants on VGGSound video dataset (Chen et al. 2020a). ... To testify the generalization ability of our V2A-Mapper, we also test on out-of-distribution dataset Image Hear (Sheffer and Adi 2023). |
| Dataset Splits | No | The paper mentions 'Following the original train/test splits, we train on 183,730 videos and evaluate on 15,446 videos.' but does not explicitly mention a separate validation split or its size. |
| Hardware Specification | Yes | The inference time is measured as the average time spent for 100 samples through the whole pipeline from input visual prompts to output waveforms on one NVIDIA RTX A6000 GPU. |
| Software Dependencies | No | The paper mentions 'We use Vi T-B/32 version for CLIP model' and 'For CLAP model and audio generator, we use pretrained models from Audio LDM' but does not specify software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | For the diffusion-based V2A-Mapper, we use a cosine noise schedule with 1000 diffusion steps during training and 200 steps at inference time. We use Adam W with a learning rate of 1.1e4, a batch size of 448 visual-audio embedding pairs, and a dropout rate of 0.1 in classifier-free guidance. |