STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training
Authors: Weihong Zhong, Mao Zheng, Duyu Tang, Xuan Luo, Heng Gong, Xiaocheng Feng, Bing Qin
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on three downstream tasks (video captioning, text-video retrieval, and video question answering) demonstrate the effectiveness of our proposed STOA-VLP (e.g. 3.7 Rouge-L improvements on MSRVTT video captioning benchmark, 2.9% accuracy improvements on MSVD video question answering benchmark, compared to previous approaches). |
| Researcher Affiliation | Collaboration | Weihong Zhong1, Mao Zheng2, Duyu Tang3, Xuan Luo2, Heng Gong1, Xiaocheng Feng1,4*, Bing Qin1,4 1 Harbin Institute of Technology 2 Tencent MLPD 3 Independent Researcher 4 Peng Cheng Laboratory |
| Pseudocode | No | The paper describes its methods in prose and equations, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Our code will be available at https://github.com/whongzhong/ STOA-VLP. |
| Open Datasets | Yes | Instead of pre-training on the commonly used dataset How To100M (Miech et al. 2019) with 136M noisy pretraining data which only contains instructional videos, we utilize the Web Vid-2M (Bain et al. 2021) to prevent an enormous computation cost. |
| Dataset Splits | No | The paper refers to common benchmark datasets used for evaluation but does not explicitly state the train/validation/test splits or their sizes within the text. |
| Hardware Specification | Yes | All pre-trainings are conducted on 128 Nvidia Tesla V100 GPUs with a batch size of 1024. |
| Software Dependencies | No | The paper mentions models and tools but does not provide specific version numbers for software dependencies (e.g., libraries, frameworks) required for reproducibility. |
| Experiment Setup | Yes | We uniformly sample 12 frames from each video. We resize and center crop them into 224x224 to split into patches with size 16x16, getting H=W=14. The maximum length of the text is set to 32. We select K=10 objects per frame. We set the number of object trajectory tokens to 20, and the number of action trajectory tokens is set to 4. |