reproducibilityindex.ai

STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training

Authors: Weihong Zhong, Mao Zheng, Duyu Tang, Xuan Luo, Heng Gong, Xiaocheng Feng, Bing Qin

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on three downstream tasks (video captioning, text-video retrieval, and video question answering) demonstrate the effectiveness of our proposed STOA-VLP (e.g. 3.7 Rouge-L improvements on MSRVTT video captioning benchmark, 2.9% accuracy improvements on MSVD video question answering benchmark, compared to previous approaches).
Researcher Affiliation	Collaboration	Weihong Zhong1, Mao Zheng2, Duyu Tang3, Xuan Luo2, Heng Gong1, Xiaocheng Feng1,4*, Bing Qin1,4 1 Harbin Institute of Technology 2 Tencent MLPD 3 Independent Researcher 4 Peng Cheng Laboratory
Pseudocode	No	The paper describes its methods in prose and equations, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	1Our code will be available at https://github.com/whongzhong/ STOA-VLP.
Open Datasets	Yes	Instead of pre-training on the commonly used dataset How To100M (Miech et al. 2019) with 136M noisy pretraining data which only contains instructional videos, we utilize the Web Vid-2M (Bain et al. 2021) to prevent an enormous computation cost.
Dataset Splits	No	The paper refers to common benchmark datasets used for evaluation but does not explicitly state the train/validation/test splits or their sizes within the text.
Hardware Specification	Yes	All pre-trainings are conducted on 128 Nvidia Tesla V100 GPUs with a batch size of 1024.
Software Dependencies	No	The paper mentions models and tools but does not provide specific version numbers for software dependencies (e.g., libraries, frameworks) required for reproducibility.
Experiment Setup	Yes	We uniformly sample 12 frames from each video. We resize and center crop them into 224x224 to split into patches with size 16x16, getting H=W=14. The maximum length of the text is set to 32. We select K=10 objects per frame. We set the number of object trajectory tokens to 20, and the number of action trajectory tokens is set to 4.