reproducibilityindex.ai

Self-Supervised MultiModal Versatile Networks

Authors: Jean-Baptiste Alayrac, Adria Recasens, Rosalia Schneider, Relja Arandjelović, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, Andrew Zisserman

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To quantitatively evaluate our learned Multi Modal Versatile (MMV) networks, we measure their performance on multiple downstream tasks, and in this way assess various properties of the representation of videos and images: verb learning (action classiﬁcation on HMBD51, UCF101 and Kinetics600); noun learning (image classiﬁcation on PASCAL VOC and Image Net); joint text and visual representation (You Cook2, MSRVTT); and audio representation (sound classiﬁcation on ESC-50 and Audio Set). The proposed MMV achieves state-of-the-art performance for selfsupervised approaches on these benchmarks, and reduces the gap to the state-of-the-art performance for supervised approaches.
Researcher Affiliation	Collaboration	Jean-Baptiste Alayrac1 Adrià Recasens1 Rosalia Schneider1 Relja Arandjelovi c1 Jason Ramapuram2,3 Jeffrey De Fauw1 Lucas Smaira1 Sander Dieleman1 Andrew Zisserman1,4 1Deep Mind 2Faculty of Science, Computer Science Dept., University of Geneva, HES-SO 3Geneva School of Business Admin. (DMML Group) 4VGG, Dept. of Eng. Science, University of Oxford
Pseudocode	No	The paper describes methodologies using text and mathematical equations but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our models are publicly available . https://github.com/deepmind/deepmind-research/tree/master/mmv
Open Datasets	Yes	Training datasets. We use the How To100M [51] and/or the train split of Audio Set [22] datasets for self-supervised training.
Dataset Splits	Yes	Some classiﬁcation datasets have ofﬁcial splits (3 for UCF101/HMDB51 and 5 for ESC-50). As per standard, split#1 serves as the validation set and is therefore used for ablations (Section 4.2), and the average accuracy over all splits is reported when comparing to the state-of-the-art (Section 4.3).
Hardware Specification	Yes	training TSM-50 takes 3 days on 32 Cloud TPUs.
Software Dependencies	No	The paper mentions software like Adam, word2vec, and Spec Augment, but does not specify their version numbers or other software dependencies with explicit versioning (e.g., 'PyTorch 1.9').
Experiment Setup	Yes	Network architectures, hyperparameters and optimization. For video we explore using S3DG [87] (dv = 1024), and TSM [44] with a Res Net50 backbone (dv = 2048) or a Res Net50x2 backbone (Res Net50 with all channels doubled [39], dv = 4096). We apply temporal and spatial average pooling at the last layer of the backbone (before the usual classiﬁcation layer) to obtain a single vector fv(xv). During training, 32 (16 for the exploration design) frames are sampled at 10 fps and 200 200 crops are used (frames are resized so that the minimum side is 224). We use the following standard augmentation during training: random crop, horizontal ﬂipping, temporal sampling and scale jittering, and color augmentation (details in the extended version [1]). Audio is represented as log MEL spectrogram with 80 bins and processed with Res Net50 and is sampled in sync with the frames. Spatial pooling is applied to obtain fa(xa) of dimension da = 2048. For the ﬁnal audio evaluation (Section 4.3), the network ingests 2 seconds of audio for fair comparison to [4, 41], otherwise we use the same duration as the input video clip. Following [49], text is processed by removing stop words, retaining a maximum or padding to 16 words, then extracting 300-dimensional Google News pre-trained word2vec [52] and ﬁnally applying a linear layer to independently map the word inputs to 2048 dimension followed by a max pooling layer over the 16 words (dt = 2048). The dimension of the shared subspaces is 512, except for the Fine And Coarse (FAC) design where we use 512 dimensions for Sva (ﬁne) and 256 for Svat (coarse). More details about architecture are provided in the extended version [1]. As done in [13], we normalize vectors prior to computing their dot products in the NCE and MIL-NCE losses and use a temperature of τ = 0.07 in the softmax as in [29, 62, 86]. When training with all three modalities on How To100M, we observe that a larger weight on the Vision-Text loss is beneﬁcial since text is more prominent. However, when training on How To100M+Audio Set, equal loss weights worked best because the audio from Audio Set is more informative. Therefore, a 10:1 loss weight ratio is used when training on How To100M and 1:1 for How To100M+Audio Set. Finally, all networks are trained from scratch using Adam [37] with an initial learning rate of 0.002, 5K steps of warm up and a half-period cosine schedule [46].