Mixtures of Experts for Audio-Visual Learning
Authors: Ying Cheng, Yang Li, Junjie He, Rui Feng
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our proposed approach AVMo E achieves superior performance across multiple audio-visual tasks, including AVE, AVVP, AVS, and AVQA. Furthermore, visual-only experimental results also indicate that our approach can tackle challenging scenes where modality information is missing. |
| Researcher Affiliation | Academia | Ying Cheng1,2,3 Yang Li1,2 Junjie He1,2 Rui Feng1,2,3 1School of Computer Science, Fudan University 2Shanghai Key Laboratory of Intelligent Information Processing 3Shanghai Collaborative Innovation Center of Intelligent Visual Computing |
| Pseudocode | No | No explicit pseudocode or algorithm blocks found. |
| Open Source Code | Yes | The source code is available at https://github.com/yingchengy/AVMOE. |
| Open Datasets | Yes | For the AVE task, we adopt the overall segment-wise accuracy of predicted event categories as the evaluation metric, and the event label of each video segment is required to be predicted in a fully-supervised manner. As shown in Table 1, we compare our proposed AVMo E with previous methods on the test set of AVE dataset. We focus on LAVis H [30] and DG-SCT [11], as they also explore parameter-efficient fine-tuning methods like ours and have achieved impressive results on this benchmark. ... For this task, we use the Look, Listen, and Parse (LLP) dataset[53], which contains 11,849 videos from different domains, covering 25 categories. LLP is a semi-supervised annotation dataset, and each video has video-level event annotations. Only 1,849 randomly selected videos have secondby-second annotations for audio and visual events. ... We conduct experiments on the AVSBench[65], a dataset containing both single-source segmentation and multiplesource segmentation. The single-source dataset contains 4,932 semi-supervised videos. During the training process, only the first sample frame of the video is fully labeled, but during evaluation, all video frames need to be predicted. The multi-source dataset contains 424 fully supervised videos with every frame labeled. ... We conduct experiments on the MUSIC-AVQA dataset[27], which contains more than 45K question-answer pairs covering 33 different question templates. |
| Dataset Splits | No | No explicit statement about validation dataset splits found. |
| Hardware Specification | Yes | For these audio-visual downstream tasks, all experiments are conducted on 8x NVIDIA 3090 (24G) GPUs, and the batch size on a single GPU varies depending on the parameters of the models. |
| Software Dependencies | Yes | For all of our experiments, we utilize the Adam[21] optimizer to train our models and set the scheduler to make the learning rate decay to 0.35 times its original value after every 3 epochs. ... For audio pre-processing, we compute audio spectrograms via the Py Torch[45] kaldi fbank with 192 triangular mel-frequency bins and set frameshift over 5.2ms. |
| Experiment Setup | Yes | For AVE and AVVP tasks, 32 latent tokens and a downsampling factor of 8 are used in the AVMo E adapter. For the AVS task on AVSBench and the AVQA task on MUSIC-AVQA, we use two latent tokens and set the downsampling factor and the number of group convolutions to 8 and 4, respectively. For all of our experiments, we utilize the Adam[21] optimizer to train our models and set the scheduler to make the learning rate decay to 0.35 times its original value after every 3 epochs. We set the learning rate of the AVMo E adapter to 5e-4, while the learning rates of the final prediction layer are the same as previous work [11], 5e-6 for AVE, 3e-4 for AVVP, 3e-4 for the S4 setting of AVS, 1.5e-4 for the MS3 setting of AVS, and 1e-4 for AVQA. |