Contrastive Audio-Visual Masked Autoencoder

Authors: Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, James R. Glass

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that the contrastive audio-visual correspondence learning objective not only enables the model to perform audio-visual retrieval tasks, but also helps the model learn a better joint representation. As a result, our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound, and is comparable with the previous best supervised pretrained model on Audio Set in the audio-visual event classification task.
Researcher Affiliation Collaboration Yuan Gong1(yuangong@mit.edu), Andrew Rouditchenko1 Alexander H. Liu1, David Harwath2, Leonid Karlinsky3,4, Hilde Kuehne4,5, James Glass1 1MIT CSAIL; 2UT Austin; 3IBM Research AI; 4MIT-IBM Watson AI Lab; 5Goethe University Frankfurt
Pseudocode No No pseudocode or clearly labeled algorithm blocks were found.
Open Source Code Yes Code and pretrained models are at https://github.com/yuangongnd/cav-mae.
Open Datasets Yes We use two major audio-visual datasets for our experiments: Audio Set Gemmeke et al. (2017) and VGGSound Chen et al. (2020).
Dataset Splits Yes Due to changes in video availability, we downloaded 1,772,023 Audio Set-2M training, 18,691 Audio Set-20K training, and 17,249 evaluation samples, respectively. VGGSound Chen et al. (2020) is a collection of 200K 10-second You Tube video clips annotated with 309 classes. We download 183,727 training and 15,446 test samples.
Hardware Specification Yes Most of our experiments are run on 4 NVIDIA GTX Titan X Pascal GPUs with 12GB memory, only the scaled-up CAV-MAEScale+ is pretrained on 4 NVIDIA RTX A5000 GPUs with 24GB memory, making our result easier to reproduce with reasonable resources.
Software Dependencies No The paper mentions several models and frameworks (e.g., Transformer, AST, ViT, MAE) but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes By default, all encoder Transformer layers are 768-dimensional and have 12 attention heads. The joint encoder of the Vanilla AV-MAE is a 12-layer Transformer; The audio and visual encoders of CAV-MAE are 11-layer Transformers (each is 768dimensional) and the joint encoder is a singlelayer Transformer. I.e., we control the total number of encoder layers of all models as 12, but CAV and CAV-MAE are larger models due to the modality-specific encoders. The decoder of AV-MAE and CAV-MAE are 8-layer Transformers with an embedding dimension of 512 and 16 attention heads. These settings are identical to the original vision MAE He et al. (2022). We fix the contrastive loss temperature τ = 0.05. For CAV-MAE, we use λc = 0.01. Our training hyper-parameters are listed in Table 4.