InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
Authors: Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Ping Luo, Ziwei Liu, Yali Wang, Limin Wang, Yu Qiao
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We learn a new video-language model, Vi CLIP, which is trained on Intern Vid using Vi T-L. It incorporates both contrastive learning and mask modeling, allowing for efficient learning of transferrable video-language representation. This model achieves state-of-the-art zero-shot action recognition in Kinetics, scoring 75.7, 73.5, and 66.4 on K400, K600, and K700 with the average top1 and top5 accuracies, respectively. It gets competitive performance on video retrieval, setting a new baseline for video-text understanding. |
| Researcher Affiliation | Academia | 1Open GVLab, Shanghai AI Laboratory 2Nanjing University 3Monash University 4The University of Hong Kong 5Nanyang Technological University 6Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences |
| Pseudocode | Yes | Here is a pseudocode example of this process: |
| Open Source Code | Yes | https://github.com/Open GVLab/Intern Video/tree/main/Data/Intern Vid |
| Open Datasets | Yes | This paper introduces Intern Vid, a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations for multimodal understanding and generation. |
| Dataset Splits | Yes | We learn Vi CLIP on subsets of Intern Vid and evaluated its performance on video-related benchmarks using full-finetuned and zero-shot settings. |
| Hardware Specification | Yes | Vi CLIP is learned with 64 NVIDIA A100 GPUs for 3 days with 50M video-text pairs. |
| Software Dependencies | No | We introduce Deep Speed and Flash Attention (Dao et al., 2022) for training and inference. |
| Experiment Setup | Yes | Table 9: Video-text retrieval fine-tuning settings. (and the table contents itself, which details optimizer, learning rates, batch size, epochs, input frame count, max text length, drop path, augmentation) |