Aligning Audio-Visual Joint Representations with an Agentic Workflow
Authors: Shentong Mo, Yibing Song
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 ExperimentsIn this section, we provide the detailed experimental setup and evaluation protocols used to assess the performance of our proposed method on various audio-visual representation learning tasks. These experiments are designed to validate the effectiveness of our approach, highlighting its advantages over existing state-of-the-art methods. |
| Researcher Affiliation | Collaboration | Shentong Mo CMU / MBZUAI DAMO Academy, Alibaba Group shentongmo@gmail.comYibing Song DAMO Academy, Alibaba Group Hupan Laboratory songyibing.syb@alibaba-inc.com |
| Pseudocode | Yes | Algorithm 1 Algorithm for AVAgent |
| Open Source Code | Yes | Answer: [Yes] Justification: See the supplemental material. |
| Open Datasets | Yes | Datasets. We utilize several well-known datasets in the audio-visual domain... Specifically, we use a subset of 144k pairs in VGG-Sound [55] for pretraining, and fine-tuning the model on audio-visual main downstream datasets. 1) For source separation, we used 40,908 video clips from 49 music categories for training and 1201 clips for testing, denoted as VGGSound-Music... Flickr-Sound Net [10]... VGGSound [55]... Audio Set [55]... |
| Dataset Splits | Yes | For audio-visual segmentation, AVSBench [58] includes 4,932 videos (in total 10,852 frames) from 23 categories, including instruments, humans, animals, etc. Following prior work [58], we used the same split of 3,452/740/740 videos for train/val/test. |
| Hardware Specification | No | Our models are implemented using state-of-the-art MAE framework [59] with specific optimizations to handle the large-scale data processing required for audio-visual tasks. |
| Software Dependencies | No | Our models are implemented using state-of-the-art MAE framework [59] with specific optimizations to handle the large-scale data processing required for audio-visual tasks. |
| Experiment Setup | Yes | The input images are resized into a 224x224 resolution. The audio is represented by log spectrograms extracted from 3s of audio at a sample rate of 8000Hz. We follow the prior work [64] and apply STFT to generate an input tensor of size 128x128 (128 frequency bands over 128 timesteps) using 50ms windows with a hop size of 25ms. The models were trained for 100 epochs using the Adam optimizer [65] with a learning rate of 1e 4 and a batch size of 128. |