reproducibilityindex.ai

Non-Autoregressive Coarse-to-Fine Video Captioning

Authors: Bang Yang, Yuexian Zou, Fenglin Liu, Can Zhang3119-3127

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on two mainstream video captioning benchmarks, i.e., MSVD and MSR-VTT, demonstrate that our approach achieves state-of-the-art performance, generates diverse descriptions, and obtains high inference efﬁciency. and Experiments In this section, we evaluate our NACF on two datasets: Microsoft Video Description (MSVD) (Chen and Dolan 2011) and MSR-Video To Text (MSR-VTT) (Xu et al. 2016).
Researcher Affiliation	Academia	Bang Yang,1 Yuexian Zou, 1,2* Fenglin Liu, 1 Can Zhang 1 1 ADSPLAB, School of ECE, Peking University, Shen Zhen, China 2 Peng Cheng Laboratory
Pseudocode	No	The paper describes decoding algorithms in text but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code is available at https://github.com/yangbang18/Non Autoregressive-Video-Captioning.
Open Datasets	Yes	Datasets. MSVD contains 1,970 video clips and roughly 80,000 English sentences. We follow the split settings in prior works (Pei et al. 2019; Pan et al. 2020), i.e., 1,200, 100 and 670 videos for training, validation and testing, respectively. MSR-VTT consists of 10,000 video clips, each of which has 20 captions and a category tag. Following the ofﬁcial split, we use 6,513, 497 and 2,990 videos for training, validation and testing, respectively.
Dataset Splits	Yes	We follow the split settings in prior works (Pei et al. 2019; Pan et al. 2020), i.e., 1,200, 100 and 670 videos for training, validation and testing, respectively. MSR-VTT consists of 10,000 video clips, each of which has 20 captions and a category tag. Following the ofﬁcial split, we use 6,513, 497 and 2,990 videos for training, validation and testing, respectively.
Hardware Specification	Yes	We measure latency4 following (Guo et al. 2019; Wang et al. 2019b), and conduct experiments in Py Torch on a single NVIDIA Titan X.
Software Dependencies	No	The paper mentions 'Py Torch' and 'NLTK toolkit', but does not provide specific version numbers for these or other key software components used for the experiments.
Experiment Setup	Yes	Parameter Settings. The maximum sequence length Nmax is set to 20 for MSVD, whereas Nmax = 30 for MSRVTT. We empirically set K = 8 for each modality. For the decoder, we adopt 1 decoder layer, 512 model dimensions, 2,048 hidden dimensions and 8 attention heads per layer. Both word and position embeddings are implemented by trainable 512-D embedding layers. For regularization, we use 0.5 dropout and 5 10 4 L2 weight decay. We train batches of 64 video-sentence pairs using ADAM (Kingma and Ba 2015) with an initial learning rate of 5 10 3. We stop training our model until 50 epochs are reached. We use NLTK toolkit (Bird, Klein, and Loper 2009) for part-of-speech tagging. In the following experiments, our NACF uses CT-MP algorithm with the number of iterations T of 5 and beam size B of 6 unless otherwise speciﬁed.