VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning
Authors: Kashu Yamazaki, Khoa Vo, Quang Sang Truong, Bhiksha Raj, Ngan Le
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experiments and extensive ablation studies on the Activity Net Captions and You Cook II datasets show that the proposed Visual-Linguistic Transformerin-Transform (VLTin T) outperforms previous state-of-the-art methods in terms of accuracy and diversity. |
| Researcher Affiliation | Academia | Kashu Yamazaki*1, Khoa Vo*1, Quang Sang Truong1, Bhiksha Raj2, 3, Ngan Le1 1 AICV Lab, University of Arkansas, Fayetteville, Arkansas, USA 2 Carnegie Mellon University, Pittsburgh, Pennsylvania, USA 3 Mohammed bin Zayed University of AI |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures). |
| Open Source Code | Yes | The source code is made publicly available at: https://github.com/UARK-AICV/VLTinT. |
| Open Datasets | Yes | We benchmark VLTin T on two popular datasets, Activity Net Captions (Krishna et al. 2017) and You Cook II (Zhou, Xu, and Corso 2018). |
| Dataset Splits | Yes | Activity Net Captions consists of 10,009 training videos and 4,917 validation videos. We follow the previous work (Lei et al. 2020) to split the original validation set into two subsets: ae-val with 2,460 videos for validation and ae-test with 2,457 videos for testing. You Cook II contains 1,333 training and 457 validation videos. |
| Hardware Specification | Yes | We ran the experiment on a single NVIDIA RTX 3090 (24GB) GPU. |
| Software Dependencies | No | The paper mentions models and optimizers used (e.g., C3D, Faster-RCNN, CLIP, Adam optimizer) but does not provide specific version numbers for software dependencies like programming languages, deep learning frameworks, or libraries. |
| Experiment Setup | Yes | We set the hidden size to 768, the number of transformer layers to 3, and the number of attention heads to 12. Adam optimizer was used to train VLTin T with an initial learning rate of 1e-4, β1 = 0.9, β2 = 0.999, L2 weight decay of 0.01, and learning rate warmup over the first 5 epochs. During the training, we use the label smoothing with a value of 0.1 and λ = 0.1. |