Continual Vision-Language Retrieval via Dynamic Knowledge Rectification

Authors: Zhenyu Cui, Yuxin Peng, Xun Wang, Manyu Zhu, Jiahuan Zhou

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on several benchmark datasets demonstrate the effectiveness of our DKR and its superiority against state-of-the-art methods. In this section, we conduct extensive experiments to validate the effectiveness of our proposed DKR. Ablation Study In this section, we conduct ablation studies under Setting-1 to evaluate the effectiveness of each component of DKR and the effect of the hyperparameter.
Researcher Affiliation Collaboration Zhenyu Cui1, Yuxin Peng1*, Xun Wang2, Manyu Zhu2, Jiahuan Zhou1 1Wangxuan Institute of Computer Technology, Peking University 2Byte Dance Inc
Pseudocode Yes Algorithm 1: Dynamic Knowledge Rectification (DKR)
Open Source Code No The paper does not explicitly state that their source code is available or provide a link to a repository for their DKR implementation.
Open Datasets Yes 1) MS-COCO Caption: MS-COCO Caption (MS-COCO) (Lin et al. 2014) is a widely used image caption dataset. 2) Flickr30K: Flickr30K (Young et al. 2014) contains 31,783 images from the Flickr website... 3) IAPR TC-12: IAPR TC-12 (Grubinger et al. 2006) consists of 20,000 images... 4) ECommerce-T2I: ECommerce-T2I (EC) (Yang et al. 2021) is a large-scale ecommerce products retrieval dataset. 5) RSICD: RSICD (Lu et al. 2017) is a remote sensing image retrieval dataset...
Dataset Splits Yes MS-COCO Caption (MS-COCO) (Lin et al. 2014) is a widely used image caption dataset. It contains 80K training images and 5K testing images, where each image has five captions. 2) Flickr30K: Flickr30K (Young et al. 2014) contains 31,783 images from the Flickr website, and each image is annotated by 5 sentences. We use 30K images as the training set and the rest 1K images as the testing set. To further evaluate the performance on the specific dataset, we follow the benchmark in (Ni et al. 2023), which randomly and uniformly divides the EC dataset into 5 sub-datasets, and sequentially train on these 5 sub-datasets.
Hardware Specification Yes The proposed DKR is implemented in Py Torch with NVIDIA V100 GPUs.
Software Dependencies No The paper mentions 'Py Torch' but does not specify a version number. It also references 'CLIP' but without a specific software version for replication.
Experiment Setup Yes Each task is trained with 35 epochs with a batch size of 280. We use Adam optimizer with (β1,β2)=(0.9, 0.99) and weight decay of 0.2 to update the whole CLIP. The initial learning rate is set to 1e-6 with 20% warm-up iterations, and a cosine-decay learning rate scheduler is also used to update the whole framework. The hyperparameter λ is set to 1.0 and 0.1 for Setting-1 and Setting-2, respectively.