COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

Authors: Sihan Chen, Xingjian He, Handong Li, Xiaojie Jin, Jiashi Feng, Jing Liu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that COSA consistently improves performance across a broad range of semantic vision-language downstream tasks, including paragraph-to-video retrieval, text-to-video/image retrieval, video/image captioning and video QA. Notably, COSA achieves state-of-the-art results on various competitive benchmarks.
Researcher Affiliation Collaboration Sihan Chen1 2 Xingjian He2 Handong Li1 2 Xiaojie Jin3 B Jiashi Feng3 Jing Liu1 2 B 1School of Artificial Intelligence, University of Chinese Academy of Sciences 2 Institute of Automation, Chinese Academy of Science 3Bytedance Inc.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code and model are released at https://github.com/TXH-mercury/COSA.
Open Datasets Yes Due to the limited scale and quality of video-text training corpus, most visionlanguage foundation models employ image-text datasets for pretraining and primarily focus on modeling visually semantic representations while disregarding temporal semantic representations and correlations... open-sourced web-crawled video-text corpora, such as Web Vid10M (Bain et al., 2021), are still two orders of magnitude smaller than its image-text counterpart (LAION-2B (Schuhmann et al., 2021))... We evaluate text-to-video retrieval on four benchmarks, including MSRVTT (Xu et al., 2016), Di De Mo (Anne Hendricks et al., 2017), LSMDC (Rohrbach et al., 2017), and Activity Net (Krishna et al., 2017a).
Dataset Splits Yes Train/val/test splits of different benchmarks of those datasets are presented in Table 11.
Hardware Specification Yes We train COSA models using the Py Torch framework and 64 Tesla V100 cards.
Software Dependencies No We train COSA models using the Py Torch framework and 64 Tesla V100 cards. ... All models utilize BERT-Base as the text encoder. No specific version numbers for PyTorch or other libraries are provided.
Experiment Setup Yes The initial learning rate is set to 1e-4, and a 10% warm-up strategy with a linear decay schedule is employed. The batch size is set to 2048. For pretraining and video-text tasks finetuning, we use 224 image resolution, while for image-text tasks finetuning, we use higher resolution, which is presented together with Other details in Appendix. (Appendix Table 10 and 12 also provide detailed settings like Lr, Bs, Epo, Res, F_train, F_test).