STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training

Authors: Weihong Zhong, Mao Zheng, Duyu Tang, Xuan Luo, Heng Gong, Xiaocheng Feng, Bing Qin

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on three downstream tasks (video captioning, text-video retrieval, and video question answering) demonstrate the effectiveness of our proposed STOA-VLP (e.g. 3.7 Rouge-L improvements on MSRVTT video captioning benchmark, 2.9% accuracy improvements on MSVD video question answering benchmark, compared to previous approaches).
Researcher Affiliation Collaboration Weihong Zhong1, Mao Zheng2, Duyu Tang3, Xuan Luo2, Heng Gong1, Xiaocheng Feng1,4*, Bing Qin1,4 1 Harbin Institute of Technology 2 Tencent MLPD 3 Independent Researcher 4 Peng Cheng Laboratory
Pseudocode No The paper describes its methods in prose and equations, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes 1Our code will be available at https://github.com/whongzhong/ STOA-VLP.
Open Datasets Yes Instead of pre-training on the commonly used dataset How To100M (Miech et al. 2019) with 136M noisy pretraining data which only contains instructional videos, we utilize the Web Vid-2M (Bain et al. 2021) to prevent an enormous computation cost.
Dataset Splits No The paper refers to common benchmark datasets used for evaluation but does not explicitly state the train/validation/test splits or their sizes within the text.
Hardware Specification Yes All pre-trainings are conducted on 128 Nvidia Tesla V100 GPUs with a batch size of 1024.
Software Dependencies No The paper mentions models and tools but does not provide specific version numbers for software dependencies (e.g., libraries, frameworks) required for reproducibility.
Experiment Setup Yes We uniformly sample 12 frames from each video. We resize and center crop them into 224x224 to split into patches with size 16x16, getting H=W=14. The maximum length of the text is set to 32. We select K=10 objects per frame. We set the number of object trajectory tokens to 20, and the number of action trajectory tokens is set to 4.