Embracing Language Inclusivity and Diversity in CLIP through Continual Language Learning

Authors: Bang Yang, Yong Dai, Xuxin Cheng, Yaowei Li, Asif Raza, Yuexian Zou

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments verify the effectiveness of CLL-CLIP and show that our approach can boost CLL-CLIP, e.g., by 6.7% in text-to-image average Recall@1 on XM3600, and improve various state-of-the-art methods consistently.
Researcher Affiliation Academia 1 ADSPLAB, School of ECE, Peking University, Shenzhen, China 2 Pengcheng Laboratory, Shenzhen, China {yangbang, chengxx, ywl, asifraza151, zouyx}@pku.edu.cn, chd-dy@foxmail.com
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code Yes Our code and data are available at https://github.com/yangbang18/CLFM.
Open Datasets Yes We build a CLL benchmark based on MSCOCO (Chen et al. 2015) and XM3600 (Thapliyal et al. 2022) to evaluate the effectiveness of our proposals. ... We train models on MSCOCO36 based on the Karpathy split (Karpathy and Fei-Fei 2015). ... Table 1: MSCOCO36 (Train/Val/Test Images 113,287/5,000/5,000).
Dataset Splits Yes Table 1: MSCOCO36 (Train/Val/Test Images 113,287/5,000/5,000). ... We train models on MSCOCO36 based on the Karpathy split (Karpathy and Fei-Fei 2015). ... The model achieving the highest summation of Recall@{1, 5, 10} on the current-task validation set is selected for training on the next task.
Hardware Specification Yes We conduct experiments in Py Torch on a single NVIDIA V100 card and every run of an experiment takes less than 20 hours.
Software Dependencies No The paper mentions 'Py Torch' but does not provide a specific version number. No other specific software dependencies with version numbers are listed.
Experiment Setup Yes We set the initial temperature of Lcm to 0.07. We search the hyperparameters γ1 and γ2 in Equation (4) from values {1, 0.1, 0.01} and set γ1 = 0.01 and γ2 = 1 based on the AR metric on the validation set. ... For each task, we set the vocab size to 10K. We use batches of 128 samples and Adam W (Loshchilov and Hutter 2019) with L2 weight decay of 0.05 to train models for 3 epochs. We set the learning rate fixed to 5e-5 after 10% warm-up iterations.