Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

Authors: Peng Jin, Jinfa Huang, Fenglin Liu, Xian Wu, Shen Ge, Guoli Song, David Clifton, Jie Chen

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on three benchmark text-video retrieval datasets prove that our EMCL can learn more discriminative video-and-language representations than previous methods, and significantly outperform previous state-of-the-art methods across all metrics.
Researcher Affiliation Collaboration Peng Jin1,3 Jinfa Huang1,3 Fenglin Liu4 Xian Wu5 Shen Ge5 Guoli Song2 David A. Clifton4,6 Jie Chen1,2,3 1School of Electronic and Computer Engineering, Peking University, China 2Peng Cheng Laboratory, Shenzhen, China 3AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, China 4Department of Engineering Science, University of Oxford, UK 5Tencent JARVIS Lab, China 6Oxford-Suzhou Centre for Advanced Research, Suzhou, China
Pseudocode Yes Algorithm 1 The proposed Expectation-Maximization Contrastive Learning, with T iterations of routing. Typically, T 9.
Open Source Code Yes Code : https://github.com/jpthu17/EMCL
Open Datasets Yes Datasets. We conduct the experiments on three popular text-video retrieval datasets, i.e., MSR-VTT [70], Activity Net Captions [27], LSMDC [55], and follow common practice [42, 13, 62] to pre-process the datasets for fair comparison. In detail, MSR-VTT [70] contains 10,000 videos, each with 20 text descriptions; We follow the 1k-A split [41] with 9,000 videos for training and 1,000 for testing. Activity Net Captions [27] contains 20,000 videos with multiple sentence descriptions; We report results on the 'vall' split (10,009 training, 4,917 testing) as in [21]. LSMDC [55] contains 118,081 video clips from 202 movies; We follow the split of [21] with 1,000 videos for testing.
Dataset Splits No The paper mentions training and testing splits, but does not explicitly mention a separate validation set split with specific details for all datasets, though Activity Net mentions a 'vall' split.
Hardware Specification No The paper does not explicitly state the hardware specifications (e.g., GPU model, CPU, memory) used for running the experiments.
Software Dependencies No The paper mentions using CLIP (Vi T-B/32) and Adam optimizer but does not specify version numbers for these or other software dependencies.
Experiment Setup Yes Implementation Details. We utilize the CLIP (Vi T-B/32) [53] equipped with Temporal Transformer [42] as pre-trained Bi-Encoder (Base Model). Following previous works [42], the frame length and caption length are 12 and 32 for MSR-VTT and LSMDC. For Activity Net, a long video retrieval dataset, we set the frame length to 64 and caption length to 64. We follow training schedules from previous works [42, 13, 62]. Concretely, we use the Adam optimizer [26] with a linear warmup. The initial learning rate is 1e-7 for text encoder and video encoder and 1e-4 for other modules. We set the temperature τ = 0.01, σ = 1, the momentum α = 0.9, the number of iterations is set to 9 and the parameter K is set to 32. The network is optimized with the batch size of 128 in 5 epochs.