Mixtures of Experts for Audio-Visual Learning

Authors: Ying Cheng, Yang Li, Junjie He, Rui Feng

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our proposed approach AVMo E achieves superior performance across multiple audio-visual tasks, including AVE, AVVP, AVS, and AVQA. Furthermore, visual-only experimental results also indicate that our approach can tackle challenging scenes where modality information is missing.
Researcher Affiliation Academia Ying Cheng1,2,3 Yang Li1,2 Junjie He1,2 Rui Feng1,2,3 1School of Computer Science, Fudan University 2Shanghai Key Laboratory of Intelligent Information Processing 3Shanghai Collaborative Innovation Center of Intelligent Visual Computing
Pseudocode No No explicit pseudocode or algorithm blocks found.
Open Source Code Yes The source code is available at https://github.com/yingchengy/AVMOE.
Open Datasets Yes For the AVE task, we adopt the overall segment-wise accuracy of predicted event categories as the evaluation metric, and the event label of each video segment is required to be predicted in a fully-supervised manner. As shown in Table 1, we compare our proposed AVMo E with previous methods on the test set of AVE dataset. We focus on LAVis H [30] and DG-SCT [11], as they also explore parameter-efficient fine-tuning methods like ours and have achieved impressive results on this benchmark. ... For this task, we use the Look, Listen, and Parse (LLP) dataset[53], which contains 11,849 videos from different domains, covering 25 categories. LLP is a semi-supervised annotation dataset, and each video has video-level event annotations. Only 1,849 randomly selected videos have secondby-second annotations for audio and visual events. ... We conduct experiments on the AVSBench[65], a dataset containing both single-source segmentation and multiplesource segmentation. The single-source dataset contains 4,932 semi-supervised videos. During the training process, only the first sample frame of the video is fully labeled, but during evaluation, all video frames need to be predicted. The multi-source dataset contains 424 fully supervised videos with every frame labeled. ... We conduct experiments on the MUSIC-AVQA dataset[27], which contains more than 45K question-answer pairs covering 33 different question templates.
Dataset Splits No No explicit statement about validation dataset splits found.
Hardware Specification Yes For these audio-visual downstream tasks, all experiments are conducted on 8x NVIDIA 3090 (24G) GPUs, and the batch size on a single GPU varies depending on the parameters of the models.
Software Dependencies Yes For all of our experiments, we utilize the Adam[21] optimizer to train our models and set the scheduler to make the learning rate decay to 0.35 times its original value after every 3 epochs. ... For audio pre-processing, we compute audio spectrograms via the Py Torch[45] kaldi fbank with 192 triangular mel-frequency bins and set frameshift over 5.2ms.
Experiment Setup Yes For AVE and AVVP tasks, 32 latent tokens and a downsampling factor of 8 are used in the AVMo E adapter. For the AVS task on AVSBench and the AVQA task on MUSIC-AVQA, we use two latent tokens and set the downsampling factor and the number of group convolutions to 8 and 4, respectively. For all of our experiments, we utilize the Adam[21] optimizer to train our models and set the scheduler to make the learning rate decay to 0.35 times its original value after every 3 epochs. We set the learning rate of the AVMo E adapter to 5e-4, while the learning rates of the final prediction layer are the same as previous work [11], 5e-6 for AVE, 3e-4 for AVVP, 3e-4 for the S4 setting of AVS, 1.5e-4 for the MS3 setting of AVS, and 1e-4 for AVQA.