Support-set bottlenecks for video-text representation learning

Authors: Mandela Patrick, Po-Yao Huang, Yuki Asano, Florian Metze, Alexander G Hauptmann, Joao F. Henriques, Andrea Vedaldi

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our proposed method outperforms others by a large margin on MSR-VTT, VATEX, Activity Net, and MSVD for video-to-text and text-to-video retrieval. 4 EXPERIMENTS We validate empirically the ability of our method to learn better representations for the downstream tasks of text-to-video and video-to-text retrieval. First, in sec. 4.2 we ablate various model components on the MSR-VTT dataset. Then, in sec. 4.3 we show that our best model significantly outperforms state-of-the-art retrieval systems on three datasets, MSR-VTT, Activty Net and VATEX.
Researcher Affiliation Collaboration Mandela Patrick , Po-Yao Huang , Florian Metze & Andrea Vedaldi Facebook AI {mandelapatrick,berniehuang,fmetze,vedaldi}@fb.com Alexander Hauptmann Language Technologies Institute Carnegie Mellon University alex@cs.cmu.edu Yuki M. Asano & João Henriques Visual Geometry Group University of Oxford {yuki,joao}@robots.ox.ac.uk
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes Datasets. How To100M (Miech et al., 2019) is a large-scale instructional video collection... MSR-VTT (Xu et al., 2016) contains 10,000 videos... VATEX (Wang et al., 2019) is a multilingual... Activity Net Caption (Krishna et al., 2017) dataset consists... MSVD (Chen & Dolan, 2011) dataset consists...
Dataset Splits Yes MSR-VTT (Xu et al., 2016) contains 10,000 videos... We report results on the 1k-A split (9,000 training, 1,000 testing) as in Liu et al. (2019)., VATEX (Wang et al., 2019)... We use the official training split with 25,991 videos and report on the validation split as in HGR (Chen et al., 2020b)., Activity Net Caption (Krishna et al., 2017)... We use the 10K training split to train from scratch/ finetune the model and report the performance on the 5K val1 split., MSVD (Chen & Dolan, 2011)... We use the standard split of 1,200, 100, and 670 videos for training, validation, and testing (Liu et al., 2019; Venugopalan et al., 2015b; Xu et al., 2015).
Hardware Specification Yes Pre-training on 1.2 million How To100M videos takes around 160 GPU hours (NVIDIA V100) for 20 epochs.
Software Dependencies No The paper mentions specific models (T5-base, ResNet152, R(2+1)D-34) and an optimizer (Adam), but does not provide version numbers for programming languages or core libraries like PyTorch or TensorFlow, which are essential for full reproducibility.
Experiment Setup Yes The margin α of the max-margin loss is 0.2, and the temperature T is set to 0.1... We use the Adam (Kingma & Ba, 2015) optimizer with a initial learning rate 5 10 5 and clip gradients greater than 0.2 during the training phase. Dropout rate is 0.3 for all datasets besides Activity Net (0.0). When training on MSR-VTT, Activty Net and Vatex, batch-size is set to 64. For MSR-VTT training, we sample and truncate videos to 32 seconds, text to 100 tokens and train for 20 epochs.