Self-Supervised Learning by Cross-Modal Audio-Video Clustering

Authors: Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani, Bernard Ghanem, Du Tran

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that XDC outperforms single-modality clustering and other multi-modal variants. XDC achieves state-of-the-art accuracy among self-supervised methods on multiple video and audio benchmarks.
Researcher Affiliation Collaboration 1King Abdullah University of Science and Technology (KAUST) 2Facebook AI
Pseudocode No The paper includes diagrams to illustrate the framework (e.g., Figure 1) but does not provide structured pseudocode or algorithm blocks.
Open Source Code Yes All XDC pretrained models are publicly released on our project website.
Open Datasets Yes Pretraining datasets. We use four datasets: Kinetics [26], Audio Set [10], IG-Kinetics [12], and IG-Random, which have 240K, 2M, 65M, and 65M training videos, respectively. ... Downstream datasets. We evaluate our pretraining performance on three downstream benchmarks: UCF101 [56], HMBD51 [29], and ESC50 [48]
Dataset Splits Yes UCF101 and HMDB51 have 3 official train/test splits, while ESC50 has 5 splits. We conduct our ablation study (Subsection 4.2) using split-1 of each dataset. We also report our average performance over all splits when we compare with state-of-the-art methods in Section 6.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., CPU/GPU models, memory) used for running experiments.
Software Dependencies No The paper mentions software components like R(2+1)D and Res Net architectures but does not specify version numbers for any libraries, frameworks, or development environments used for implementation.
Experiment Setup Yes We use clips of L=8 frames for pretraining and finetuning our visual encoder Ev. We scale frames such that the smallest dimension is 256 pixels and then random crop images of size 224 224. We extract video clips at 30 fps and employ temporal jittering during training. For the audio input, we sample 2 seconds and use Q=40 MEL filters and P=100 audio frames. For inference on the downstream tasks, we uniformly sample 10 clips per testing example and average their predictions to make a video-level prediction.