Aligning Audio-Visual Joint Representations with an Agentic Workflow

Authors: Shentong Mo, Yibing Song

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 ExperimentsIn this section, we provide the detailed experimental setup and evaluation protocols used to assess the performance of our proposed method on various audio-visual representation learning tasks. These experiments are designed to validate the effectiveness of our approach, highlighting its advantages over existing state-of-the-art methods.
Researcher Affiliation Collaboration Shentong Mo CMU / MBZUAI DAMO Academy, Alibaba Group shentongmo@gmail.comYibing Song DAMO Academy, Alibaba Group Hupan Laboratory songyibing.syb@alibaba-inc.com
Pseudocode Yes Algorithm 1 Algorithm for AVAgent
Open Source Code Yes Answer: [Yes] Justification: See the supplemental material.
Open Datasets Yes Datasets. We utilize several well-known datasets in the audio-visual domain... Specifically, we use a subset of 144k pairs in VGG-Sound [55] for pretraining, and fine-tuning the model on audio-visual main downstream datasets. 1) For source separation, we used 40,908 video clips from 49 music categories for training and 1201 clips for testing, denoted as VGGSound-Music... Flickr-Sound Net [10]... VGGSound [55]... Audio Set [55]...
Dataset Splits Yes For audio-visual segmentation, AVSBench [58] includes 4,932 videos (in total 10,852 frames) from 23 categories, including instruments, humans, animals, etc. Following prior work [58], we used the same split of 3,452/740/740 videos for train/val/test.
Hardware Specification No Our models are implemented using state-of-the-art MAE framework [59] with specific optimizations to handle the large-scale data processing required for audio-visual tasks.
Software Dependencies No Our models are implemented using state-of-the-art MAE framework [59] with specific optimizations to handle the large-scale data processing required for audio-visual tasks.
Experiment Setup Yes The input images are resized into a 224x224 resolution. The audio is represented by log spectrograms extracted from 3s of audio at a sample rate of 8000Hz. We follow the prior work [64] and apply STFT to generate an input tensor of size 128x128 (128 frequency bands over 128 timesteps) using 50ms windows with a hop size of 25ms. The models were trained for 100 epochs using the Adam optimizer [65] with a learning rate of 1e 4 and a batch size of 128.