Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training
Authors: Weihong Zhong, Mao Zheng, Duyu Tang, Xuan Luo, Heng Gong, Xiaocheng Feng, Bing Qin
AAAI 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on three downstream tasks (video captioning, text-video retrieval, and video question answering) demonstrate the effectiveness of our proposed STOA-VLP (e.g. 3.7 Rouge-L improvements on MSRVTT video captioning benchmark, 2.9% accuracy improvements on MSVD video question answering benchmark, compared to previous approaches). |
| Researcher Affiliation | Collaboration | Weihong Zhong1, Mao Zheng2, Duyu Tang3, Xuan Luo2, Heng Gong1, Xiaocheng Feng1,4*, Bing Qin1,4 1 Harbin Institute of Technology 2 Tencent MLPD 3 Independent Researcher 4 Peng Cheng Laboratory |
| Pseudocode | No | The paper describes its methods in prose and equations, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Our code will be available at https://github.com/whongzhong/ STOA-VLP. |
| Open Datasets | Yes | Instead of pre-training on the commonly used dataset How To100M (Miech et al. 2019) with 136M noisy pretraining data which only contains instructional videos, we utilize the Web Vid-2M (Bain et al. 2021) to prevent an enormous computation cost. |
| Dataset Splits | No | The paper refers to common benchmark datasets used for evaluation but does not explicitly state the train/validation/test splits or their sizes within the text. |
| Hardware Specification | Yes | All pre-trainings are conducted on 128 Nvidia Tesla V100 GPUs with a batch size of 1024. |
| Software Dependencies | No | The paper mentions models and tools but does not provide specific version numbers for software dependencies (e.g., libraries, frameworks) required for reproducibility. |
| Experiment Setup | Yes | We uniformly sample 12 frames from each video. We resize and center crop them into 224x224 to split into patches with size 16x16, getting H=W=14. The maximum length of the text is set to 32. We select K=10 objects per frame. We set the number of object trajectory tokens to 20, and the number of action trajectory tokens is set to 4. |