SBAT: Video Captioning with Sparse Boundary-Aware Transformer

Authors: Tao Jin, Siyu Huang, Ming Chen, Yingming Li, Zhongfei Zhang

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive empirical studies on two benchmark video captioning datasets. The quantitative, qualitative and ablation experimental results comprehensively reveal the effectiveness of our proposed methods.
Researcher Affiliation Collaboration Tao Jin1 , Siyu Huang2, Ming Chen3, Yingming Li1 , Zhongfei Zhang4 1College of Information Science & Electronic Engineering, Zhejiang University, China 2Baidu Research, China 3Alibaba Group, China 4Department of Computer Science, Binghamton University, USA
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any concrete access information (e.g., repository link, explicit statement of code release) for the source code of the methodology described.
Open Datasets Yes We evaluate SBAT on two benchmark video captioning datasets, MSVD [Chen and Dolan, 2011] and MSR-VTT [Xu et al., 2016]. Both the datasets are provided by Microsoft Research, and a series of state-of-the-art methods have been proposed based on these datasets in recent years.
Dataset Splits No The paper mentions using a "validation set" but does not provide specific details about the training, validation, or test dataset splits (e.g., percentages, sample counts, or a citation to a specific predefined split used for reproduction).
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments, such as CPU/GPU models or memory.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers (e.g., libraries, frameworks) required to replicate the experiment.
Experiment Setup Yes The hidden size is set to 512 for all the multihead attention mechanisms. The numbers of heads and attention blocks are 8 and 4, respectively. The value of α is set to 0.8 in the encoder and 0 in the decoder. In the training phase, we use Adam [Kingma and Ba, 2014] algorithm to optimize the loss function. The learning rate is initially set to 0.0001. If the CIDEr on validation set does not improve over 10 epochs, we change the learning rate to 0.00002. The batch size is set to 32. In the testing phase, we use the beam-search method with a beam-width of 5 to generate words. We use the pre-trained word2vec embeddings to initialize the word vectors. Each word is represented as a 300-dimension vector.