Self-Supervised Learning by Cross-Modal Audio-Video Clustering
Authors: Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani, Bernard Ghanem, Du Tran
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that XDC outperforms single-modality clustering and other multi-modal variants. XDC achieves state-of-the-art accuracy among self-supervised methods on multiple video and audio benchmarks. |
| Researcher Affiliation | Collaboration | 1King Abdullah University of Science and Technology (KAUST) 2Facebook AI |
| Pseudocode | No | The paper includes diagrams to illustrate the framework (e.g., Figure 1) but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | All XDC pretrained models are publicly released on our project website. |
| Open Datasets | Yes | Pretraining datasets. We use four datasets: Kinetics [26], Audio Set [10], IG-Kinetics [12], and IG-Random, which have 240K, 2M, 65M, and 65M training videos, respectively. ... Downstream datasets. We evaluate our pretraining performance on three downstream benchmarks: UCF101 [56], HMBD51 [29], and ESC50 [48] |
| Dataset Splits | Yes | UCF101 and HMDB51 have 3 official train/test splits, while ESC50 has 5 splits. We conduct our ablation study (Subsection 4.2) using split-1 of each dataset. We also report our average performance over all splits when we compare with state-of-the-art methods in Section 6. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., CPU/GPU models, memory) used for running experiments. |
| Software Dependencies | No | The paper mentions software components like R(2+1)D and Res Net architectures but does not specify version numbers for any libraries, frameworks, or development environments used for implementation. |
| Experiment Setup | Yes | We use clips of L=8 frames for pretraining and finetuning our visual encoder Ev. We scale frames such that the smallest dimension is 256 pixels and then random crop images of size 224 224. We extract video clips at 30 fps and employ temporal jittering during training. For the audio input, we sample 2 seconds and use Q=40 MEL filters and P=100 audio frames. For inference on the downstream tasks, we uniformly sample 10 clips per testing example and average their predictions to make a video-level prediction. |