BMU-MoCo: Bidirectional Momentum Update for Continual Video-Language Modeling

Authors: Yizhao Gao, Nanyi Fei, Haoyu Lu, Zhiwu Lu, Hao Jiang, Yijie Li, Zhao Cao

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive results show that our BMU-Mo Co remarkably outperforms recent competitors w.r.t. video-text retrieval performance and forgetting rate, even without using any extra data or dynamic networks.
Researcher Affiliation Collaboration Yizhao Gao1,2 Nanyi Fei1,2 Haoyu Lu1,2 Zhiwu Lu1,2, Hao Jiang3 Yijie Li3 Zhao Cao3 1Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China 2Beijing Key Laboratory of Big Data Management and Analysis Methods 3Huawei Poisson Lab, Hangzhou, Zhejiang, China
Pseudocode Yes The full (pseudocode) algorithm of our BMU-Mo Co is presented in the supplementary material.
Open Source Code No 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No]
Open Datasets Yes Under our CVLM setting, models are supposed to be sequentially trained on five widely-used video-text datasets: VATEX [54], Activity Net [25], MSR-VTT [55], Di De Mo [20], and MSVD [10].
Dataset Splits Yes VATEX [54] is a large-scale open-domain dataset, which has 25,991 videos with 250K text descriptions for training, 3,000 videos for validation and 6,000 videos for testing.
Hardware Specification Yes The total training time on five tasks is around 20 hours with 8 Tesla V100 GPUs for each model.
Software Dependencies No The paper mentions using specific pre-trained models like 'Vi T-Base [13]/BERT-Base [12]' as encoders, but it does not provide specific version numbers for software libraries, frameworks, or dependencies (e.g., PyTorch 1.9, CUDA 11.1).
Experiment Setup Yes For the first epoch of each task under our CVLM setting, we set the learning rate to 5e-5 and decay it to 5e-6 afterwards. (3) We select the two momentum coefficients m = 0.99, ˆm = 0.99, and the temperature τ = 0.07. We set the batch size NB to 48 and the queue size NQ to 1,440. (4) The total training time on five tasks is around 20 hours with 8 Tesla V100 GPUs for each model.