reproducibilityindex.ai

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Authors: Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Ping Luo, Ziwei Liu, Yali Wang, Limin Wang, Yu Qiao

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We learn a new video-language model, Vi CLIP, which is trained on Intern Vid using Vi T-L. It incorporates both contrastive learning and mask modeling, allowing for efficient learning of transferrable video-language representation. This model achieves state-of-the-art zero-shot action recognition in Kinetics, scoring 75.7, 73.5, and 66.4 on K400, K600, and K700 with the average top1 and top5 accuracies, respectively. It gets competitive performance on video retrieval, setting a new baseline for video-text understanding.
Researcher Affiliation	Academia	1Open GVLab, Shanghai AI Laboratory 2Nanjing University 3Monash University 4The University of Hong Kong 5Nanyang Technological University 6Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
Pseudocode	Yes	Here is a pseudocode example of this process:
Open Source Code	Yes	https://github.com/Open GVLab/Intern Video/tree/main/Data/Intern Vid
Open Datasets	Yes	This paper introduces Intern Vid, a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations for multimodal understanding and generation.
Dataset Splits	Yes	We learn Vi CLIP on subsets of Intern Vid and evaluated its performance on video-related benchmarks using full-finetuned and zero-shot settings.
Hardware Specification	Yes	Vi CLIP is learned with 64 NVIDIA A100 GPUs for 3 days with 50M video-text pairs.
Software Dependencies	No	We introduce Deep Speed and Flash Attention (Dao et al., 2022) for training and inference.
Experiment Setup	Yes	Table 9: Video-text retrieval fine-tuning settings. (and the table contents itself, which details optimizer, learning rates, batch size, epochs, input frame count, max text length, drop path, augmentation)