reproducibilityindex.ai

Self-Supervised Learning by Cross-Modal Audio-Video Clustering

Authors: Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani, Bernard Ghanem, Du Tran

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that XDC outperforms single-modality clustering and other multi-modal variants. XDC achieves state-of-the-art accuracy among self-supervised methods on multiple video and audio benchmarks.
Researcher Affiliation	Collaboration	1King Abdullah University of Science and Technology (KAUST) 2Facebook AI
Pseudocode	No	The paper includes diagrams to illustrate the framework (e.g., Figure 1) but does not provide structured pseudocode or algorithm blocks.
Open Source Code	Yes	All XDC pretrained models are publicly released on our project website.
Open Datasets	Yes	Pretraining datasets. We use four datasets: Kinetics [26], Audio Set [10], IG-Kinetics [12], and IG-Random, which have 240K, 2M, 65M, and 65M training videos, respectively. ... Downstream datasets. We evaluate our pretraining performance on three downstream benchmarks: UCF101 [56], HMBD51 [29], and ESC50 [48]
Dataset Splits	Yes	UCF101 and HMDB51 have 3 ofﬁcial train/test splits, while ESC50 has 5 splits. We conduct our ablation study (Subsection 4.2) using split-1 of each dataset. We also report our average performance over all splits when we compare with state-of-the-art methods in Section 6.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., CPU/GPU models, memory) used for running experiments.
Software Dependencies	No	The paper mentions software components like R(2+1)D and Res Net architectures but does not specify version numbers for any libraries, frameworks, or development environments used for implementation.
Experiment Setup	Yes	We use clips of L=8 frames for pretraining and ﬁnetuning our visual encoder Ev. We scale frames such that the smallest dimension is 256 pixels and then random crop images of size 224 224. We extract video clips at 30 fps and employ temporal jittering during training. For the audio input, we sample 2 seconds and use Q=40 MEL ﬁlters and P=100 audio frames. For inference on the downstream tasks, we uniformly sample 10 clips per testing example and average their predictions to make a video-level prediction.