VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning

Authors: Kashu Yamazaki, Khoa Vo, Quang Sang Truong, Bhiksha Raj, Ngan Le

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments and extensive ablation studies on the Activity Net Captions and You Cook II datasets show that the proposed Visual-Linguistic Transformerin-Transform (VLTin T) outperforms previous state-of-the-art methods in terms of accuracy and diversity.
Researcher Affiliation Academia Kashu Yamazaki*1, Khoa Vo*1, Quang Sang Truong1, Bhiksha Raj2, 3, Ngan Le1 1 AICV Lab, University of Arkansas, Fayetteville, Arkansas, USA 2 Carnegie Mellon University, Pittsburgh, Pennsylvania, USA 3 Mohammed bin Zayed University of AI
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures).
Open Source Code Yes The source code is made publicly available at: https://github.com/UARK-AICV/VLTinT.
Open Datasets Yes We benchmark VLTin T on two popular datasets, Activity Net Captions (Krishna et al. 2017) and You Cook II (Zhou, Xu, and Corso 2018).
Dataset Splits Yes Activity Net Captions consists of 10,009 training videos and 4,917 validation videos. We follow the previous work (Lei et al. 2020) to split the original validation set into two subsets: ae-val with 2,460 videos for validation and ae-test with 2,457 videos for testing. You Cook II contains 1,333 training and 457 validation videos.
Hardware Specification Yes We ran the experiment on a single NVIDIA RTX 3090 (24GB) GPU.
Software Dependencies No The paper mentions models and optimizers used (e.g., C3D, Faster-RCNN, CLIP, Adam optimizer) but does not provide specific version numbers for software dependencies like programming languages, deep learning frameworks, or libraries.
Experiment Setup Yes We set the hidden size to 768, the number of transformer layers to 3, and the number of attention heads to 12. Adam optimizer was used to train VLTin T with an initial learning rate of 1e-4, β1 = 0.9, β2 = 0.999, L2 weight decay of 0.01, and learning rate warmup over the first 5 epochs. During the training, we use the label smoothing with a value of 0.1 and λ = 0.1.