Labelling unlabelled videos from scratch with multi-modal self-supervision

Authors: Yuki Asano, Mandela Patrick, Christian Rupprecht, Andrea Vedaldi

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experiments are divided into three parts. First, in Section 4.1, we analyze the need for using both modalities when clustering and investigate the effect of our individual technical contributions via ablations and comparison to other approaches. Second, in Section 4.2, we demonstrate how our method achieves its stated goal of labelling a video dataset without human supervision. Third, in Section 4.3, we show that a side effect of our method is to learn effective audio-visual representations that can be used for downstream tasks e.g.video action retrieval, establishing a new state of the art.
Researcher Affiliation Collaboration Yuki M. Asano1 Mandela Patrick1,2 Christian Rupprecht1 Andrea Vedaldi1,2 1 Visual Geometry Group, University of Oxford yuki@robots.ox.ac.uk 2 Facebook AI Research mandelapatrick@fb.com
Pseudocode No No pseudocode or algorithm block found. The paper describes algorithms verbally (e.g., 'the fast Sinkhorn-Knopp algorithm', 'greedy algorithm').
Open Source Code No Code will be made available at https://github.com/facebookresearch/selavi
Open Datasets Yes The first is the recently released VGG-Sound [17], which contains around 200k videos obtained in the wild from You Tube with low labelling noise and covering 309 categories of general classes. The second dataset is Kinetics-400 [41], which contains around 230k videos covering 400 human action categories. Third, we test our results on Kinetics-Sound proposed in [3], formed by filtering the Kinetics dataset to 34 classes that are potentially manifested visually and audibly, leading to 22k videos. Lastly, we use the small-scale AVE Dataset [68], originally proposed for audio-visual event localization and containing only around 4k videos.
Dataset Splits No No explicit training/validation/test dataset splits with percentages, counts, or specific methodologies (like cross-validation or stratified splits) are provided. The paper mentions training on VGG-Sound and unsupervisedly finetuning on others, but no specific split ratios are given for reproduction.
Hardware Specification No No specific hardware specifications (GPU/CPU models, memory, or cluster details) used for running experiments are provided. Phrases like 'Our visual encoder is a R(2+1)D-18' describe the model, not the hardware.
Software Dependencies No The paper mentions using 'SGD' for optimization and the 'Faiss library [40]' for k-means. It also implicitly refers to 'PyTorch' through a GitHub link for a model. However, specific version numbers for these software components are not provided, which is necessary for reproducibility.
Experiment Setup Yes Our visual encoder is a R(2+1)D-18 [70] network and our audio encoder is a Res Net [32] with 9 layers. For optimization, we use SGD for 200 epochs with weight decay of 10 5 and momentum of 0.9, further implementation details are provided in Appendix A.2.