Learning to Discretely Compose Reasoning Module Networks for Video Captioning

Authors: Ganchao Tan, Daqing Liu, Meng Wang, Zheng-Jun Zha

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on MSVD and MSR-VTT datasets demonstrate the proposed RMN outperforms the state-of-the-art methods while providing an explicit and explainable generation process.
Researcher Affiliation Academia Ganchao Tan1 , Daqing Liu1 , Meng Wang2 and Zheng-Jun Zha1 1University of Science and Technology of China 2Hefei University of Technology {tgc1997, liudq}@mail.ustc.edu.cn, eric.mengwang@gmail.com, zhazj@ustc.edu.cn
Pseudocode No The paper describes the methodology using text and mathematical equations but does not include a formally labeled pseudocode or algorithm block.
Open Source Code Yes Our code is available at https://github.com/tgc1997/RMN.
Open Datasets Yes MSVD. The MSVD dataset [Chen and Dolan, 2011] consists of 1,970 short video clips selected from Youtube... MSR-VTT. The MSR-VTT [Xu et al., 2016] is a large-scale dataset for the open domain video captioning...
Dataset Splits Yes To be consistent with previous works, we split the dataset to 3 subsets, 1,200 clips for training, 100 clips for validation, and the remaining 670 clips for testing. [...] Following the existing works, we use the standard splits, namely 6,513 clips for training, 497 clips for validation, and 2,990 clips for testing.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types) used for running its experiments.
Software Dependencies No The paper mentions 'Spacy Tagging Tool' but does not provide specific version numbers for software dependencies.
Experiment Setup Yes Our model is optimized by Adam Optimizer [Kingma and Ba, 2015], the initial learning rate is set to 1e-4. For the MSVD dataset, the hidden size of the LSTM is 512 and the learning rate is divided by 10 every 10 epochs. For the MSRVTT dataset, the hidden size of the LSTM is 1,300 and the learning rate is divided by 3 every 5 epochs. During testing, we use beam search with size 2 for the final caption generation.