reproducibilityindex.ai

VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning

Authors: Kashu Yamazaki, Khoa Vo, Quang Sang Truong, Bhiksha Raj, Ngan Le

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experiments and extensive ablation studies on the Activity Net Captions and You Cook II datasets show that the proposed Visual-Linguistic Transformerin-Transform (VLTin T) outperforms previous state-of-the-art methods in terms of accuracy and diversity.
Researcher Affiliation	Academia	Kashu Yamazaki1, Khoa Vo1, Quang Sang Truong1, Bhiksha Raj2, 3, Ngan Le1 1 AICV Lab, University of Arkansas, Fayetteville, Arkansas, USA 2 Carnegie Mellon University, Pittsburgh, Pennsylvania, USA 3 Mohammed bin Zayed University of AI
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures).
Open Source Code	Yes	The source code is made publicly available at: https://github.com/UARK-AICV/VLTinT.
Open Datasets	Yes	We benchmark VLTin T on two popular datasets, Activity Net Captions (Krishna et al. 2017) and You Cook II (Zhou, Xu, and Corso 2018).
Dataset Splits	Yes	Activity Net Captions consists of 10,009 training videos and 4,917 validation videos. We follow the previous work (Lei et al. 2020) to split the original validation set into two subsets: ae-val with 2,460 videos for validation and ae-test with 2,457 videos for testing. You Cook II contains 1,333 training and 457 validation videos.
Hardware Specification	Yes	We ran the experiment on a single NVIDIA RTX 3090 (24GB) GPU.
Software Dependencies	No	The paper mentions models and optimizers used (e.g., C3D, Faster-RCNN, CLIP, Adam optimizer) but does not provide specific version numbers for software dependencies like programming languages, deep learning frameworks, or libraries.
Experiment Setup	Yes	We set the hidden size to 768, the number of transformer layers to 3, and the number of attention heads to 12. Adam optimizer was used to train VLTin T with an initial learning rate of 1e-4, β1 = 0.9, β2 = 0.999, L2 weight decay of 0.01, and learning rate warmup over the first 5 epochs. During the training, we use the label smoothing with a value of 0.1 and λ = 0.1.