Non-Autoregressive Coarse-to-Fine Video Captioning
Authors: Bang Yang, Yuexian Zou, Fenglin Liu, Can Zhang3119-3127
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on two mainstream video captioning benchmarks, i.e., MSVD and MSR-VTT, demonstrate that our approach achieves state-of-the-art performance, generates diverse descriptions, and obtains high inference efficiency. and Experiments In this section, we evaluate our NACF on two datasets: Microsoft Video Description (MSVD) (Chen and Dolan 2011) and MSR-Video To Text (MSR-VTT) (Xu et al. 2016). |
| Researcher Affiliation | Academia | Bang Yang,1 Yuexian Zou, 1,2* Fenglin Liu, 1 Can Zhang 1 1 ADSPLAB, School of ECE, Peking University, Shen Zhen, China 2 Peng Cheng Laboratory |
| Pseudocode | No | The paper describes decoding algorithms in text but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at https://github.com/yangbang18/Non Autoregressive-Video-Captioning. |
| Open Datasets | Yes | Datasets. MSVD contains 1,970 video clips and roughly 80,000 English sentences. We follow the split settings in prior works (Pei et al. 2019; Pan et al. 2020), i.e., 1,200, 100 and 670 videos for training, validation and testing, respectively. MSR-VTT consists of 10,000 video clips, each of which has 20 captions and a category tag. Following the official split, we use 6,513, 497 and 2,990 videos for training, validation and testing, respectively. |
| Dataset Splits | Yes | We follow the split settings in prior works (Pei et al. 2019; Pan et al. 2020), i.e., 1,200, 100 and 670 videos for training, validation and testing, respectively. MSR-VTT consists of 10,000 video clips, each of which has 20 captions and a category tag. Following the official split, we use 6,513, 497 and 2,990 videos for training, validation and testing, respectively. |
| Hardware Specification | Yes | We measure latency4 following (Guo et al. 2019; Wang et al. 2019b), and conduct experiments in Py Torch on a single NVIDIA Titan X. |
| Software Dependencies | No | The paper mentions 'Py Torch' and 'NLTK toolkit', but does not provide specific version numbers for these or other key software components used for the experiments. |
| Experiment Setup | Yes | Parameter Settings. The maximum sequence length Nmax is set to 20 for MSVD, whereas Nmax = 30 for MSRVTT. We empirically set K = 8 for each modality. For the decoder, we adopt 1 decoder layer, 512 model dimensions, 2,048 hidden dimensions and 8 attention heads per layer. Both word and position embeddings are implemented by trainable 512-D embedding layers. For regularization, we use 0.5 dropout and 5 10 4 L2 weight decay. We train batches of 64 video-sentence pairs using ADAM (Kingma and Ba 2015) with an initial learning rate of 5 10 3. We stop training our model until 50 epochs are reached. We use NLTK toolkit (Bird, Klein, and Loper 2009) for part-of-speech tagging. In the following experiments, our NACF uses CT-MP algorithm with the number of iterations T of 5 and beam size B of 6 unless otherwise specified. |