Non-Autoregressive Coarse-to-Fine Video Captioning

Authors: Bang Yang, Yuexian Zou, Fenglin Liu, Can Zhang3119-3127

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on two mainstream video captioning benchmarks, i.e., MSVD and MSR-VTT, demonstrate that our approach achieves state-of-the-art performance, generates diverse descriptions, and obtains high inference efficiency. and Experiments In this section, we evaluate our NACF on two datasets: Microsoft Video Description (MSVD) (Chen and Dolan 2011) and MSR-Video To Text (MSR-VTT) (Xu et al. 2016).
Researcher Affiliation Academia Bang Yang,1 Yuexian Zou, 1,2* Fenglin Liu, 1 Can Zhang 1 1 ADSPLAB, School of ECE, Peking University, Shen Zhen, China 2 Peng Cheng Laboratory
Pseudocode No The paper describes decoding algorithms in text but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/yangbang18/Non Autoregressive-Video-Captioning.
Open Datasets Yes Datasets. MSVD contains 1,970 video clips and roughly 80,000 English sentences. We follow the split settings in prior works (Pei et al. 2019; Pan et al. 2020), i.e., 1,200, 100 and 670 videos for training, validation and testing, respectively. MSR-VTT consists of 10,000 video clips, each of which has 20 captions and a category tag. Following the official split, we use 6,513, 497 and 2,990 videos for training, validation and testing, respectively.
Dataset Splits Yes We follow the split settings in prior works (Pei et al. 2019; Pan et al. 2020), i.e., 1,200, 100 and 670 videos for training, validation and testing, respectively. MSR-VTT consists of 10,000 video clips, each of which has 20 captions and a category tag. Following the official split, we use 6,513, 497 and 2,990 videos for training, validation and testing, respectively.
Hardware Specification Yes We measure latency4 following (Guo et al. 2019; Wang et al. 2019b), and conduct experiments in Py Torch on a single NVIDIA Titan X.
Software Dependencies No The paper mentions 'Py Torch' and 'NLTK toolkit', but does not provide specific version numbers for these or other key software components used for the experiments.
Experiment Setup Yes Parameter Settings. The maximum sequence length Nmax is set to 20 for MSVD, whereas Nmax = 30 for MSRVTT. We empirically set K = 8 for each modality. For the decoder, we adopt 1 decoder layer, 512 model dimensions, 2,048 hidden dimensions and 8 attention heads per layer. Both word and position embeddings are implemented by trainable 512-D embedding layers. For regularization, we use 0.5 dropout and 5 10 4 L2 weight decay. We train batches of 64 video-sentence pairs using ADAM (Kingma and Ba 2015) with an initial learning rate of 5 10 3. We stop training our model until 50 epochs are reached. We use NLTK toolkit (Bird, Klein, and Loper 2009) for part-of-speech tagging. In the following experiments, our NACF uses CT-MP algorithm with the number of iterations T of 5 and beam size B of 6 unless otherwise specified.