CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Alignment

Authors: Hongwei Xue, Yuchong Sun, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang Li, Jiebo Luo

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through a series of comparative experiments and analyses, we find that the data scale and domain gap between language sources have large impacts. By these observations, we propose an Omnisource Cross-modal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-Vi P. Extensive results show that our approach improves the performance of CLIP on video-text retrieval by a large margin. Our model achieves state-of-the-art results on a variety of datasets, including MSR-VTT, Di De Mo, LSMDC, and Activity Net.
Researcher Affiliation Collaboration Hongwei Xue1 , Yuchong Sun2 , Bei Liu3 , Jianlong Fu3 , Ruihua Song2, Houqiang Li1, Jiebo Luo4 1University of Science and Technology of China, Hefei, China, 2Renmin University of China, Beijing, China, 3Microsoft Research, Beijing, China, 4University of Rochester, Rochester, NY
Pseudocode No The paper describes methods in text and uses equations, but it does not include a clearly labeled pseudocode or algorithm block.
Open Source Code Yes We release our code and pre-trained CLIP-Vi P models at https: //github.com/microsoft/XPretrain/tree/main/CLIP-Vi P.
Open Datasets Yes Two open-domain video-text datasets are used: Web Vid-2.5M (Bain et al., 2021) with 2.5 million pairs and HD-VILA-100M (Xue et al., 2022) with 100M pairs. ... MSR-VTT (Xu et al., 2016) contains 10K You Tube videos with 200K descriptions. We follow previous works (Yu et al., 2018; Liu et al., 2019) to train models on 9K videos... (b) Di De Mo (Anne Hendricks et al., 2017)... (c) LSMDC (Rohrbach et al., 2016)... (d) Activity Net Captions (Krishna et al., 2017a).
Dataset Splits Yes Evaluation is conducted on a test set of 1,000 videos from movies disjoint from the train and validation sets. ... We follow the paragraph-to-video retrieval setting (Zhang et al., 2018; Liu et al., 2019) to train models on 10K videos and report results on the val1 set with 4.9K videos.
Hardware Specification Yes We train our model with 32 NVIDIA Tesla V100 GPUs in a batch size of 1024.
Software Dependencies No The paper mentions using 'Adam W optimizer' and 'CLIP’s tokenizer', but it does not specify version numbers for these or other software libraries.
Experiment Setup Yes We use Adam W optimizer (Loshchilov & Hutter, 2019), and empirically set an initial learning rate as 5e-6 and a fixed weight decay as 5e-2. For the learning rate schedule, we adopt a cosine decay with a warmup strategy. We train our model with 32 NVIDIA Tesla V100 GPUs in a batch size of 1024. ... 1) Batch size: we fine-tune our model with a batch size of 128 for all downstream tasks for a fair comparison. 2) Learning rate and weight decay: we empirically set them to 1e-6 and 0.2, respectively. 3) Number of epochs: due to the various scales of downstream datasets, we set epoch numbers to 5, 20, 10, and 20 for MSRVTT, Di De Mo, LSMDC, and Activity Net, respectively. 4) Frame number: for a fair comparison, we set frame number to 12 except for Activity Net Captions (set to 32).