Dig into Multi-modal Cues for Video Retrieval with Hierarchical Alignment
Authors: Wenzhe Wang, Mengdan Zhang, Runnan Chen, Guanyu Cai, Penghao Zhou, Pai Peng, Xiaowei Guo, Jian Wu, Xing Sun
IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments are conducted on the MSR-VTT [Xu et al., 2016], Activity Net [Caba Heilbron et al., 2015], and LSMDC [Torabi et al., 2016] to evaluate video-text bi-directional retrieval performance. Experimental results show that our proposed model can achieve state-of-the-art results on all the above datasets. Ablation studies are carried out to evaluate the effectiveness of each part of our model. |
| Researcher Affiliation | Collaboration | Wenzhe Wang1 , Mengdan Zhang2 , Runnan Chen3 , Guanyu Cai4 , Penghao Zhou2 , Pai Peng2 , Xiaowei Guo2 , Jian Wu1, , Xing Sun2, 1Zhejiang University, China 2Youtu Lab, Tencent, China 3The University of Hong Kong, China 4Tongji University, China |
| Pseudocode | No | The paper describes the model architecture and processes in text and with a diagram (Figure 2), but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code or a direct link to a code repository for the described methodology. |
| Open Datasets | Yes | How To100M [Miech et al., 2019] consists of more than one million instructional videos from You Tube. MSR-VTT [Xu et al., 2016] contains ten thousand You Tube videos... Activity Net Captions [Caba Heilbron et al., 2015] is composed of twenty thousand You Tube videos... LSMDC [Torabi et al., 2016] consists of 118,081 video clips... |
| Dataset Splits | No | The paper mentions splits for training and testing: '9,000 samples are utilized for training and the other 1,000 samples are utilized for test' (MSR-VTT) and '10,009 videos are utilized for training and 4,917 videos are utilized for test' (Activity Net, using val1 split), and '1,000 videos are utilized for test and the other videos are utilized for training' (LSMDC). While 'val1' is mentioned, it's explicitly used as a train/test split, and a distinct, separate validation set is not consistently defined across all datasets or described for hyperparameter tuning. |
| Hardware Specification | Yes | Our experiments are conducted on NVIDIA V100 32G GPUs. |
| Software Dependencies | No | The paper mentions using a 'pre-trained Bert model [Devlin et al., 2018]' and 'Adam optimizer', but it does not specify version numbers for these or any other ancillary software components like libraries or programming languages. |
| Experiment Setup | Yes | The inverse temperature parameter of the softmax function λ in Equation 5 equals to 9, the iteration time K in Equation 7 equals to 3, the mini-batch size B equals to 32 for the MSR-VTT, LSMDC, and Activity Net and equals to 64 for the How To100M. The margin value in LT ri equals to 0.2, and the margin value Θ in LMar equals to 0.05. β utilized to balance the two losses equals to 1e-4 for the MSR-VTT, LSMDC, and equals to 1e-5 for the How To100M and Activity Net. The learning rate of all our experiments is initialized as 5e-5. Our modality-specific transformers and the holistic transformer are composed of 2 layers and 4 attention heads, a dropout rate of 0.1, a hidden size duni of 1024, and an intermediate size of 3072. |