COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Authors: Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, Thomas Brox

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method on the datasets Activity Net-captions [39] and Youcook2 [40]. We measure the performance on the retrieval task with standard retrieval metrics, i.e., recall at K (R@K e.g. R@1, R@5, R@10) and Median Rank (MR). Influence of each component. We show results of a model ablation study in Table 1. Comparison to the state of the art Table 2 summarizes the results of paragraph to video and video to paragraph retrieval tasks on the Activity Net-captions dataset.
Researcher Affiliation Collaboration 1University of Freiburg, 2University of Maryland Baltimore County 1{gings, zolfagha, brox}@cs.uni-freiburg.de, 2 hpirsiav@umbc.edu
Pseudocode No The paper does not contain explicit pseudocode or algorithm blocks. It describes the model architecture and components in text and with diagrams (Figure 1 and Figure 2).
Open Source Code Yes All code is available open-source at https://github.com/gingsi/coot-videotext
Open Datasets Yes Datasets. We evaluate our method on the datasets Activity Net-captions [39] and Youcook2 [40]. Activity Net-captions consists of 20k You Tube videos with an average length of 2 minutes, with 72k clip-sentence pairs. There are 10k, 5k and 5k videos in train, val1 and val2, respectively. Youcook2 contains 2000 videos with a total number of 14k clips. This dataset is collected from You Tube and covers 89 types of recipes. There are 9.6k clips for training and 3.2k clips for validation. For each clip there is a manually annotated textual description.
Dataset Splits Yes Activity Net-captions consists of 20k You Tube videos with an average length of 2 minutes, with 72k clip-sentence pairs. There are 10k, 5k and 5k videos in train, val1 and val2, respectively. Youcook2 contains 2000 videos with a total number of 14k clips. This dataset is collected from You Tube and covers 89 types of recipes. There are 9.6k clips for training and 3.2k clips for validation. We use a mini-batch size of 64 video/paragraph pairs and sample all corresponding clips and sentences.
Hardware Specification Yes Training is fast and takes less than 3 hours on two GTX1080Ti GPUs (without data I/O). We thank Ehsan Adeli for helpful comments, Antoine Miech for providing details on their retrieval evaluation, and Facebook for providing us a GPU server with Tesla P100 processors for this research work.
Software Dependencies No The paper mentions 'BERT-Base, Uncased' and 'Resnet-152' and 'Res Next-101' and 'GELU [34]' but does not provide version numbers for these software components or libraries.
Experiment Setup Yes Similar to [21] we set all margins α = αg = β = γ = µ = 0.2. We use a mini-batch size of 64 video/paragraph pairs and sample all corresponding clips and sentences. To apply the cycle-consistency loss, we found that sampling 1 clip per video and 1 sentence per paragraph works best. The optimal loss weight λ depends on architecture and dataset. As activation function, we found GELU [34] to perform best. We set the hidden size to 384 and use a pointwise linear layer to reduce the input feature dimension. We use one self-attention layer for the T-Transformer and one self-attention and one cross-attention layer for Co T. For further details on optimization and hyperparameters we refer the interested reader to the supplementary material.