Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Aligning Audio-Visual Joint Representations with an Agentic Workflow
Authors: Shentong Mo, Yibing Song
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 ExperimentsIn this section, we provide the detailed experimental setup and evaluation protocols used to assess the performance of our proposed method on various audio-visual representation learning tasks. These experiments are designed to validate the effectiveness of our approach, highlighting its advantages over existing state-of-the-art methods. |
| Researcher Affiliation | Collaboration | Shentong Mo CMU / MBZUAI DAMO Academy, Alibaba Group EMAIL Song DAMO Academy, Alibaba Group Hupan Laboratory EMAIL |
| Pseudocode | Yes | Algorithm 1 Algorithm for AVAgent |
| Open Source Code | Yes | Answer: [Yes] Justification: See the supplemental material. |
| Open Datasets | Yes | Datasets. We utilize several well-known datasets in the audio-visual domain... Specifically, we use a subset of 144k pairs in VGG-Sound [55] for pretraining, and fine-tuning the model on audio-visual main downstream datasets. 1) For source separation, we used 40,908 video clips from 49 music categories for training and 1201 clips for testing, denoted as VGGSound-Music... Flickr-Sound Net [10]... VGGSound [55]... Audio Set [55]... |
| Dataset Splits | Yes | For audio-visual segmentation, AVSBench [58] includes 4,932 videos (in total 10,852 frames) from 23 categories, including instruments, humans, animals, etc. Following prior work [58], we used the same split of 3,452/740/740 videos for train/val/test. |
| Hardware Specification | No | Our models are implemented using state-of-the-art MAE framework [59] with specific optimizations to handle the large-scale data processing required for audio-visual tasks. |
| Software Dependencies | No | Our models are implemented using state-of-the-art MAE framework [59] with specific optimizations to handle the large-scale data processing required for audio-visual tasks. |
| Experiment Setup | Yes | The input images are resized into a 224x224 resolution. The audio is represented by log spectrograms extracted from 3s of audio at a sample rate of 8000Hz. We follow the prior work [64] and apply STFT to generate an input tensor of size 128x128 (128 frequency bands over 128 timesteps) using 50ms windows with a hop size of 25ms. The models were trained for 100 epochs using the Adam optimizer [65] with a learning rate of 1e 4 and a batch size of 128. |