Adaptive Feature Abstraction for Translating Video to Text

Authors: Yunchen Pu, Martin Min, Zhe Gan, Lawrence Carin

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The proposed approach is evaluated on three benchmark datasets: You Tube2Text, M-VAD and MSR-VTT. Along with visualizing the results and how the model works, these experiments quantitatively demonstrate the effectiveness of the proposed adaptive spatiotemporal feature abstraction for translating videos to sentences with rich semantics.
Researcher Affiliation Collaboration Department of Electrical and Computer Engineering, Duke University {yunchen.pu, zhe.gan, lcarin}@duke.edu Machine Learning Group, NEC Laboratories America renqiang@nec-labs.com
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statement about releasing source code or links to a code repository.
Open Datasets Yes We present results on three benchmark datasets: Microsoft Research Video Description Corpus (You Tube2Text) (Chen and Dolan 2011), Montreal Video Annotation Dataset (M-VAD) (Torabi, C Pal, and Courville 2015), and Microsoft Research Video to Text (MSR-VTT) (Xu et al. 2016).
Dataset Splits Yes For fair comparison, we used the same splits as provided in Venugopalan et al. (2015b), with 1200 videos for training, 100 videos for validation, and 670 videos for testing.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions software components and models like C3D, LSTM, RNN, but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes We do not perform any dataset-specific tuning and regularization other than dropout (Srivastava et al. 2014) and early stopping on validation sets.