Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks
Authors: Haoyi Duan, Yan Xia, Zhou Mingze, Li Tang, Jieming Zhu, Zhou Zhao
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental evaluations demonstrate that our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA. Furthermore, our model exhibits promising performance in challenging few-shot and zero-shot scenarios. |
| Researcher Affiliation | Collaboration | Haoyi Duan1 Yan Xia1 Mingze Zhou1 Li Tang1 Jieming Zhu3 Zhou Zhao1,2 1 Zhejiang University 2Shanghai Artificial Intelligence Laboratory 3Huawei Noah s Ark Lab |
| Pseudocode | No | The paper describes the DG-SCT module and its operations with equations and textual explanations, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The source code and pre-trained models are available at https://github.com/haoyi-duan/DG-SCT. |
| Open Datasets | Yes | We evaluate the AVE task on the AVE dataset [33] originating from the Audio Set [9]. The AVE dataset contains 4, 143 videos covering 28 categories. We evaluate the AVVP task on the LLP dataset [32]... We evaluate the AVS task on the AVSBench dataset [43]... We conduct our experiments of the AVQA task on the MUSIC-AVQA dataset [14]... The pre-training data of our zero-shot task is VGG-Sound(40K), which is split from the VGG-Sound dataset [2]. |
| Dataset Splits | No | The paper mentions using test sets and training models, but it does not explicitly specify the proportions, counts, or naming of a validation split for the datasets used in the main experiments. It states for few-shot learning: "For the few-shot setting, we train the model by selecting shot samples of each category from the training dataset of the downstream tasks," but does not describe how a separate validation set was created or used for hyperparameter tuning or early stopping during training. |
| Hardware Specification | Yes | All of our experiments are trained on one NVIDIA A100 GPU. |
| Software Dependencies | No | The paper mentions "Mind Spore" in the acknowledgements: "We also gratefully acknowledge the support of Mind Spore (https://www.mindspore.cn), which is a new deep learning computing framework." However, it does not specify a version number for Mind Spore or any other software libraries, frameworks, or programming languages used. |
| Experiment Setup | Yes | For the AVE and the AVQA tasks, we set α = 0.3, β = 0.05, and γ = 0.1. We train the model with a batch size of 8 and a learning rate of 5 10 4 and 1 10 4, respectively; For the AVVP task, we set α = 0.3, β = 0.05, and γ = 0.05. We train the model with a batch size of 8 and a learning rate of 3 10 4; For the S4 setting of the AVS task, we set α = 0.3, β = 0.05, and γ = 0.05. We train the model with a batch size of 8 and a learning rate of 3 10 4; For the MS3 setting of the AVS task, we set α = 0.2, β = 0.1, and γ = 0.1. We train the model with a batch size of 2 and a learning rate of 1.5 10 4. For few-shot/zero-shot tasks, we set the learning rate to 3 10 4 with a batch size of 2. |