Learning to Discretely Compose Reasoning Module Networks for Video Captioning
Authors: Ganchao Tan, Daqing Liu, Meng Wang, Zheng-Jun Zha
IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on MSVD and MSR-VTT datasets demonstrate the proposed RMN outperforms the state-of-the-art methods while providing an explicit and explainable generation process. |
| Researcher Affiliation | Academia | Ganchao Tan1 , Daqing Liu1 , Meng Wang2 and Zheng-Jun Zha1 1University of Science and Technology of China 2Hefei University of Technology {tgc1997, liudq}@mail.ustc.edu.cn, eric.mengwang@gmail.com, zhazj@ustc.edu.cn |
| Pseudocode | No | The paper describes the methodology using text and mathematical equations but does not include a formally labeled pseudocode or algorithm block. |
| Open Source Code | Yes | Our code is available at https://github.com/tgc1997/RMN. |
| Open Datasets | Yes | MSVD. The MSVD dataset [Chen and Dolan, 2011] consists of 1,970 short video clips selected from Youtube... MSR-VTT. The MSR-VTT [Xu et al., 2016] is a large-scale dataset for the open domain video captioning... |
| Dataset Splits | Yes | To be consistent with previous works, we split the dataset to 3 subsets, 1,200 clips for training, 100 clips for validation, and the remaining 670 clips for testing. [...] Following the existing works, we use the standard splits, namely 6,513 clips for training, 497 clips for validation, and 2,990 clips for testing. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types) used for running its experiments. |
| Software Dependencies | No | The paper mentions 'Spacy Tagging Tool' but does not provide specific version numbers for software dependencies. |
| Experiment Setup | Yes | Our model is optimized by Adam Optimizer [Kingma and Ba, 2015], the initial learning rate is set to 1e-4. For the MSVD dataset, the hidden size of the LSTM is 512 and the learning rate is divided by 10 every 10 epochs. For the MSRVTT dataset, the hidden size of the LSTM is 1,300 and the learning rate is divided by 3 every 5 epochs. During testing, we use beam search with size 2 for the final caption generation. |