Self-Supervised MultiModal Versatile Networks
Authors: Jean-Baptiste Alayrac, Adria Recasens, Rosalia Schneider, Relja Arandjelović, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, Andrew Zisserman
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To quantitatively evaluate our learned Multi Modal Versatile (MMV) networks, we measure their performance on multiple downstream tasks, and in this way assess various properties of the representation of videos and images: verb learning (action classification on HMBD51, UCF101 and Kinetics600); noun learning (image classification on PASCAL VOC and Image Net); joint text and visual representation (You Cook2, MSRVTT); and audio representation (sound classification on ESC-50 and Audio Set). The proposed MMV achieves state-of-the-art performance for selfsupervised approaches on these benchmarks, and reduces the gap to the state-of-the-art performance for supervised approaches. |
| Researcher Affiliation | Collaboration | Jean-Baptiste Alayrac1 Adrià Recasens1 Rosalia Schneider1 Relja Arandjelovi c1 Jason Ramapuram2,3 Jeffrey De Fauw1 Lucas Smaira1 Sander Dieleman1 Andrew Zisserman1,4 1Deep Mind 2Faculty of Science, Computer Science Dept., University of Geneva, HES-SO 3Geneva School of Business Admin. (DMML Group) 4VGG, Dept. of Eng. Science, University of Oxford |
| Pseudocode | No | The paper describes methodologies using text and mathematical equations but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our models are publicly available . https://github.com/deepmind/deepmind-research/tree/master/mmv |
| Open Datasets | Yes | Training datasets. We use the How To100M [51] and/or the train split of Audio Set [22] datasets for self-supervised training. |
| Dataset Splits | Yes | Some classification datasets have official splits (3 for UCF101/HMDB51 and 5 for ESC-50). As per standard, split#1 serves as the validation set and is therefore used for ablations (Section 4.2), and the average accuracy over all splits is reported when comparing to the state-of-the-art (Section 4.3). |
| Hardware Specification | Yes | training TSM-50 takes 3 days on 32 Cloud TPUs. |
| Software Dependencies | No | The paper mentions software like Adam, word2vec, and Spec Augment, but does not specify their version numbers or other software dependencies with explicit versioning (e.g., 'PyTorch 1.9'). |
| Experiment Setup | Yes | Network architectures, hyperparameters and optimization. For video we explore using S3DG [87] (dv = 1024), and TSM [44] with a Res Net50 backbone (dv = 2048) or a Res Net50x2 backbone (Res Net50 with all channels doubled [39], dv = 4096). We apply temporal and spatial average pooling at the last layer of the backbone (before the usual classification layer) to obtain a single vector fv(xv). During training, 32 (16 for the exploration design) frames are sampled at 10 fps and 200 200 crops are used (frames are resized so that the minimum side is 224). We use the following standard augmentation during training: random crop, horizontal flipping, temporal sampling and scale jittering, and color augmentation (details in the extended version [1]). Audio is represented as log MEL spectrogram with 80 bins and processed with Res Net50 and is sampled in sync with the frames. Spatial pooling is applied to obtain fa(xa) of dimension da = 2048. For the final audio evaluation (Section 4.3), the network ingests 2 seconds of audio for fair comparison to [4, 41], otherwise we use the same duration as the input video clip. Following [49], text is processed by removing stop words, retaining a maximum or padding to 16 words, then extracting 300-dimensional Google News pre-trained word2vec [52] and finally applying a linear layer to independently map the word inputs to 2048 dimension followed by a max pooling layer over the 16 words (dt = 2048). The dimension of the shared subspaces is 512, except for the Fine And Coarse (FAC) design where we use 512 dimensions for Sva (fine) and 256 for Svat (coarse). More details about architecture are provided in the extended version [1]. As done in [13], we normalize vectors prior to computing their dot products in the NCE and MIL-NCE losses and use a temperature of τ = 0.07 in the softmax as in [29, 62, 86]. When training with all three modalities on How To100M, we observe that a larger weight on the Vision-Text loss is beneficial since text is more prominent. However, when training on How To100M+Audio Set, equal loss weights worked best because the audio from Audio Set is more informative. Therefore, a 10:1 loss weight ratio is used when training on How To100M and 1:1 for How To100M+Audio Set. Finally, all networks are trained from scratch using Adam [37] with an initial learning rate of 0.002, 5K steps of warm up and a half-period cosine schedule [46]. |