COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
Authors: Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, Thomas Brox
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method on the datasets Activity Net-captions [39] and Youcook2 [40]. We measure the performance on the retrieval task with standard retrieval metrics, i.e., recall at K (R@K e.g. R@1, R@5, R@10) and Median Rank (MR). Influence of each component. We show results of a model ablation study in Table 1. Comparison to the state of the art Table 2 summarizes the results of paragraph to video and video to paragraph retrieval tasks on the Activity Net-captions dataset. |
| Researcher Affiliation | Collaboration | 1University of Freiburg, 2University of Maryland Baltimore County 1{gings, zolfagha, brox}@cs.uni-freiburg.de, 2 hpirsiav@umbc.edu |
| Pseudocode | No | The paper does not contain explicit pseudocode or algorithm blocks. It describes the model architecture and components in text and with diagrams (Figure 1 and Figure 2). |
| Open Source Code | Yes | All code is available open-source at https://github.com/gingsi/coot-videotext |
| Open Datasets | Yes | Datasets. We evaluate our method on the datasets Activity Net-captions [39] and Youcook2 [40]. Activity Net-captions consists of 20k You Tube videos with an average length of 2 minutes, with 72k clip-sentence pairs. There are 10k, 5k and 5k videos in train, val1 and val2, respectively. Youcook2 contains 2000 videos with a total number of 14k clips. This dataset is collected from You Tube and covers 89 types of recipes. There are 9.6k clips for training and 3.2k clips for validation. For each clip there is a manually annotated textual description. |
| Dataset Splits | Yes | Activity Net-captions consists of 20k You Tube videos with an average length of 2 minutes, with 72k clip-sentence pairs. There are 10k, 5k and 5k videos in train, val1 and val2, respectively. Youcook2 contains 2000 videos with a total number of 14k clips. This dataset is collected from You Tube and covers 89 types of recipes. There are 9.6k clips for training and 3.2k clips for validation. We use a mini-batch size of 64 video/paragraph pairs and sample all corresponding clips and sentences. |
| Hardware Specification | Yes | Training is fast and takes less than 3 hours on two GTX1080Ti GPUs (without data I/O). We thank Ehsan Adeli for helpful comments, Antoine Miech for providing details on their retrieval evaluation, and Facebook for providing us a GPU server with Tesla P100 processors for this research work. |
| Software Dependencies | No | The paper mentions 'BERT-Base, Uncased' and 'Resnet-152' and 'Res Next-101' and 'GELU [34]' but does not provide version numbers for these software components or libraries. |
| Experiment Setup | Yes | Similar to [21] we set all margins α = αg = β = γ = µ = 0.2. We use a mini-batch size of 64 video/paragraph pairs and sample all corresponding clips and sentences. To apply the cycle-consistency loss, we found that sampling 1 clip per video and 1 sentence per paragraph works best. The optimal loss weight λ depends on architecture and dataset. As activation function, we found GELU [34] to perform best. We set the hidden size to 384 and use a pointwise linear layer to reduce the input feature dimension. We use one self-attention layer for the T-Transformer and one self-attention and one cross-attention layer for Co T. For further details on optimization and hyperparameters we refer the interested reader to the supplementary material. |