CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Alignment
Authors: Hongwei Xue, Yuchong Sun, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang Li, Jiebo Luo
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through a series of comparative experiments and analyses, we find that the data scale and domain gap between language sources have large impacts. By these observations, we propose an Omnisource Cross-modal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-Vi P. Extensive results show that our approach improves the performance of CLIP on video-text retrieval by a large margin. Our model achieves state-of-the-art results on a variety of datasets, including MSR-VTT, Di De Mo, LSMDC, and Activity Net. |
| Researcher Affiliation | Collaboration | Hongwei Xue1 , Yuchong Sun2 , Bei Liu3 , Jianlong Fu3 , Ruihua Song2, Houqiang Li1, Jiebo Luo4 1University of Science and Technology of China, Hefei, China, 2Renmin University of China, Beijing, China, 3Microsoft Research, Beijing, China, 4University of Rochester, Rochester, NY |
| Pseudocode | No | The paper describes methods in text and uses equations, but it does not include a clearly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | We release our code and pre-trained CLIP-Vi P models at https: //github.com/microsoft/XPretrain/tree/main/CLIP-Vi P. |
| Open Datasets | Yes | Two open-domain video-text datasets are used: Web Vid-2.5M (Bain et al., 2021) with 2.5 million pairs and HD-VILA-100M (Xue et al., 2022) with 100M pairs. ... MSR-VTT (Xu et al., 2016) contains 10K You Tube videos with 200K descriptions. We follow previous works (Yu et al., 2018; Liu et al., 2019) to train models on 9K videos... (b) Di De Mo (Anne Hendricks et al., 2017)... (c) LSMDC (Rohrbach et al., 2016)... (d) Activity Net Captions (Krishna et al., 2017a). |
| Dataset Splits | Yes | Evaluation is conducted on a test set of 1,000 videos from movies disjoint from the train and validation sets. ... We follow the paragraph-to-video retrieval setting (Zhang et al., 2018; Liu et al., 2019) to train models on 10K videos and report results on the val1 set with 4.9K videos. |
| Hardware Specification | Yes | We train our model with 32 NVIDIA Tesla V100 GPUs in a batch size of 1024. |
| Software Dependencies | No | The paper mentions using 'Adam W optimizer' and 'CLIP’s tokenizer', but it does not specify version numbers for these or other software libraries. |
| Experiment Setup | Yes | We use Adam W optimizer (Loshchilov & Hutter, 2019), and empirically set an initial learning rate as 5e-6 and a fixed weight decay as 5e-2. For the learning rate schedule, we adopt a cosine decay with a warmup strategy. We train our model with 32 NVIDIA Tesla V100 GPUs in a batch size of 1024. ... 1) Batch size: we fine-tune our model with a batch size of 128 for all downstream tasks for a fair comparison. 2) Learning rate and weight decay: we empirically set them to 1e-6 and 0.2, respectively. 3) Number of epochs: due to the various scales of downstream datasets, we set epoch numbers to 5, 20, 10, and 20 for MSRVTT, Di De Mo, LSMDC, and Activity Net, respectively. 4) Frame number: for a fair comparison, we set frame number to 12 except for Activity Net Captions (set to 32). |