Cross-modal Active Complementary Learning with Self-refining Correspondence

Authors: Yang Qin, Yuan Sun, Dezhong Peng, Joey Tianyi Zhou, Xi Peng, Peng Hu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We carry out extensive experiments on three image-text benchmarks, i.e., Flickr30K, MS-COCO, and CC152K, to verify the superior robustness of our CRCL against synthetic and real-world noisy correspondences.
Researcher Affiliation Collaboration 1 College of Computer Science, Sichuan University, Chengdu, China. 2 Centre for Frontier AI Research (CFAR) and Institute of High Performance Computing (IHPC), A*STAR, Singapore. 3 Chengdu Ruibei Yingte Information Technology Co., Ltd, Chengdu, China. 4 Sichuan Zhiqian Technology Co., Ltd, Chengdu, China.
Pseudocode Yes Algorithm 1: The pseudo-code of CRCL
Open Source Code Yes Code is available at https://github.com/Qin Yang79/CRCL.
Open Datasets Yes For an extensive evaluation, we use three benchmark datasets (i.e., Flickr30K [34], MSCOCO [35] and CC152K [12]) in our experiments.
Dataset Splits Yes Following [36], 30,000 images are employed for training, 1,000 images for validation, and 1,000 images for testing in our experiments. MS-COCO is a large-scale image-text dataset, which has 123,287 images, and 5 captions are given to describe each image. We follow the split of [36, 8] to carry out our experiments, i.e., 5000 validation images, 5000 test images, and the rest for training. CC152K contains 150,000 image-text pairs for training, 1,000 pairs for validation, and 1,000 pairs for testing.
Hardware Specification No No: The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running experiments.
Software Dependencies No No: The paper mentions using 'BUTD features' and 'Bi-GRU' as textual backbone, but does not provide specific version numbers for any software libraries, frameworks, or dependencies.
Experiment Setup Yes Specifically, the shared hyper-parameters are set as the same as the original works [4, 9], e.g., the batch size is 128, the word embedding size is 300, and the joint embedding dimensionality is 1,024. More specific hyper-parameters and implementation details are given in our supplementary material due to the space limitation.