Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks

Authors: Haoyi Duan, Yan Xia, Zhou Mingze, Li Tang, Jieming Zhu, Zhou Zhao

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental evaluations demonstrate that our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA. Furthermore, our model exhibits promising performance in challenging few-shot and zero-shot scenarios.
Researcher Affiliation Collaboration Haoyi Duan1 Yan Xia1 Mingze Zhou1 Li Tang1 Jieming Zhu3 Zhou Zhao1,2 1 Zhejiang University 2Shanghai Artificial Intelligence Laboratory 3Huawei Noah s Ark Lab
Pseudocode No The paper describes the DG-SCT module and its operations with equations and textual explanations, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The source code and pre-trained models are available at https://github.com/haoyi-duan/DG-SCT.
Open Datasets Yes We evaluate the AVE task on the AVE dataset [33] originating from the Audio Set [9]. The AVE dataset contains 4, 143 videos covering 28 categories. We evaluate the AVVP task on the LLP dataset [32]... We evaluate the AVS task on the AVSBench dataset [43]... We conduct our experiments of the AVQA task on the MUSIC-AVQA dataset [14]... The pre-training data of our zero-shot task is VGG-Sound(40K), which is split from the VGG-Sound dataset [2].
Dataset Splits No The paper mentions using test sets and training models, but it does not explicitly specify the proportions, counts, or naming of a validation split for the datasets used in the main experiments. It states for few-shot learning: "For the few-shot setting, we train the model by selecting shot samples of each category from the training dataset of the downstream tasks," but does not describe how a separate validation set was created or used for hyperparameter tuning or early stopping during training.
Hardware Specification Yes All of our experiments are trained on one NVIDIA A100 GPU.
Software Dependencies No The paper mentions "Mind Spore" in the acknowledgements: "We also gratefully acknowledge the support of Mind Spore (https://www.mindspore.cn), which is a new deep learning computing framework." However, it does not specify a version number for Mind Spore or any other software libraries, frameworks, or programming languages used.
Experiment Setup Yes For the AVE and the AVQA tasks, we set α = 0.3, β = 0.05, and γ = 0.1. We train the model with a batch size of 8 and a learning rate of 5 10 4 and 1 10 4, respectively; For the AVVP task, we set α = 0.3, β = 0.05, and γ = 0.05. We train the model with a batch size of 8 and a learning rate of 3 10 4; For the S4 setting of the AVS task, we set α = 0.3, β = 0.05, and γ = 0.05. We train the model with a batch size of 8 and a learning rate of 3 10 4; For the MS3 setting of the AVS task, we set α = 0.2, β = 0.1, and γ = 0.1. We train the model with a batch size of 2 and a learning rate of 1.5 10 4. For few-shot/zero-shot tasks, we set the learning rate to 3 10 4 with a batch size of 2.