BMU-MoCo: Bidirectional Momentum Update for Continual Video-Language Modeling
Authors: Yizhao Gao, Nanyi Fei, Haoyu Lu, Zhiwu Lu, Hao Jiang, Yijie Li, Zhao Cao
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive results show that our BMU-Mo Co remarkably outperforms recent competitors w.r.t. video-text retrieval performance and forgetting rate, even without using any extra data or dynamic networks. |
| Researcher Affiliation | Collaboration | Yizhao Gao1,2 Nanyi Fei1,2 Haoyu Lu1,2 Zhiwu Lu1,2, Hao Jiang3 Yijie Li3 Zhao Cao3 1Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China 2Beijing Key Laboratory of Big Data Management and Analysis Methods 3Huawei Poisson Lab, Hangzhou, Zhejiang, China |
| Pseudocode | Yes | The full (pseudocode) algorithm of our BMU-Mo Co is presented in the supplementary material. |
| Open Source Code | No | 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No] |
| Open Datasets | Yes | Under our CVLM setting, models are supposed to be sequentially trained on five widely-used video-text datasets: VATEX [54], Activity Net [25], MSR-VTT [55], Di De Mo [20], and MSVD [10]. |
| Dataset Splits | Yes | VATEX [54] is a large-scale open-domain dataset, which has 25,991 videos with 250K text descriptions for training, 3,000 videos for validation and 6,000 videos for testing. |
| Hardware Specification | Yes | The total training time on five tasks is around 20 hours with 8 Tesla V100 GPUs for each model. |
| Software Dependencies | No | The paper mentions using specific pre-trained models like 'Vi T-Base [13]/BERT-Base [12]' as encoders, but it does not provide specific version numbers for software libraries, frameworks, or dependencies (e.g., PyTorch 1.9, CUDA 11.1). |
| Experiment Setup | Yes | For the first epoch of each task under our CVLM setting, we set the learning rate to 5e-5 and decay it to 5e-6 afterwards. (3) We select the two momentum coefficients m = 0.99, ˆm = 0.99, and the temperature τ = 0.07. We set the batch size NB to 48 and the queue size NQ to 1,440. (4) The total training time on five tasks is around 20 hours with 8 Tesla V100 GPUs for each model. |