Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning

Authors: Jingwen Chen, Yingwei Pan, Yehao Li, Ting Yao, Hongyang Chao, Tao Mei8167-8174

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments are conducted on both MSVD and MSR-VTT video captioning datasets, and superior results are reported when comparing to conventional RNN-based encoder-decoder techniques.
Researcher Affiliation Collaboration Jingwen Chen,1 Yingwei Pan,2 Yehao Li,1 Ting Yao,2 Hongyang Chao,1,3 Tao Mei2 1Sun Yat-sen University, Guangzhou, China 2JD AI Research, Beijing, China 3The Key Laboratory of Machine Intelligence and Advanced Computing (Sun Yat-sen University), Ministry of Education, Guangzhou, China
Pseudocode No The paper describes the model architecture and operations using mathematical formulas and descriptive text, but it does not include any explicit pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement or link to the open-source code for the described methodology.
Open Datasets Yes Microsoft Research Video Description Corpus (MSVD) (Chen and Dolan 2011) and Microsoft Research Video to Text (MSR-VTT) (Xu et al. 2016).
Dataset Splits Yes MSVD. Following the standard settings in previous works (Pan et al. 2016; Yao et al. 2015), we take 1200 videos for training, 100 for validation and 670 for testing. MSR-VTT. Following the official split, we utilize 6513, 497, and 2990 video clips for training, validation and testing, respectively.
Hardware Specification Yes Table 4: Comparison of training time between TDConv ED and MP-LSTM on MSVD (Nvidia K40 GPU).
Software Dependencies No The paper mentions the Adam optimizer and the Microsoft COCO Evaluation Server API but does not specify version numbers for any programming languages, libraries, or frameworks used for implementation or experimentation.
Experiment Setup Yes The kernel size k of convolutions in encoder and decoder is set as 3. The convolutional encoder/decoder in our TDConv ED consists of 2 stacked temporal deformable/shifted convolutional blocks. Both the dimensions of intermediate states in encoder and decoder, i.e., Dr and Df, are set as 512. The dimension of the hidden layer for measuring attention distribution Da is set as 512. The whole model is trained by Adam (Kingma and Ba 2015) optimizer. We set the initial learning rate as 10 3 and the mini-batch size as 64. The maximum training iteration is set as 30 epoches.