reproducibilityindex.ai

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Authors: Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, Thomas Brox

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our method on the datasets Activity Net-captions [39] and Youcook2 [40]. We measure the performance on the retrieval task with standard retrieval metrics, i.e., recall at K (R@K e.g. R@1, R@5, R@10) and Median Rank (MR). Inﬂuence of each component. We show results of a model ablation study in Table 1. Comparison to the state of the art Table 2 summarizes the results of paragraph to video and video to paragraph retrieval tasks on the Activity Net-captions dataset.
Researcher Affiliation	Collaboration	1University of Freiburg, 2University of Maryland Baltimore County 1{gings, zolfagha, brox}@cs.uni-freiburg.de, 2 hpirsiav@umbc.edu
Pseudocode	No	The paper does not contain explicit pseudocode or algorithm blocks. It describes the model architecture and components in text and with diagrams (Figure 1 and Figure 2).
Open Source Code	Yes	All code is available open-source at https://github.com/gingsi/coot-videotext
Open Datasets	Yes	Datasets. We evaluate our method on the datasets Activity Net-captions [39] and Youcook2 [40]. Activity Net-captions consists of 20k You Tube videos with an average length of 2 minutes, with 72k clip-sentence pairs. There are 10k, 5k and 5k videos in train, val1 and val2, respectively. Youcook2 contains 2000 videos with a total number of 14k clips. This dataset is collected from You Tube and covers 89 types of recipes. There are 9.6k clips for training and 3.2k clips for validation. For each clip there is a manually annotated textual description.
Dataset Splits	Yes	Activity Net-captions consists of 20k You Tube videos with an average length of 2 minutes, with 72k clip-sentence pairs. There are 10k, 5k and 5k videos in train, val1 and val2, respectively. Youcook2 contains 2000 videos with a total number of 14k clips. This dataset is collected from You Tube and covers 89 types of recipes. There are 9.6k clips for training and 3.2k clips for validation. We use a mini-batch size of 64 video/paragraph pairs and sample all corresponding clips and sentences.
Hardware Specification	Yes	Training is fast and takes less than 3 hours on two GTX1080Ti GPUs (without data I/O). We thank Ehsan Adeli for helpful comments, Antoine Miech for providing details on their retrieval evaluation, and Facebook for providing us a GPU server with Tesla P100 processors for this research work.
Software Dependencies	No	The paper mentions 'BERT-Base, Uncased' and 'Resnet-152' and 'Res Next-101' and 'GELU [34]' but does not provide version numbers for these software components or libraries.
Experiment Setup	Yes	Similar to [21] we set all margins α = αg = β = γ = µ = 0.2. We use a mini-batch size of 64 video/paragraph pairs and sample all corresponding clips and sentences. To apply the cycle-consistency loss, we found that sampling 1 clip per video and 1 sentence per paragraph works best. The optimal loss weight λ depends on architecture and dataset. As activation function, we found GELU [34] to perform best. We set the hidden size to 384 and use a pointwise linear layer to reduce the input feature dimension. We use one self-attention layer for the T-Transformer and one self-attention and one cross-attention layer for Co T. For further details on optimization and hyperparameters we refer the interested reader to the supplementary material.