Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval
Authors: Hailang Huang, Zhijie Nie, Ziqiao Wang, Ziyu Shang
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on various image-text retrieval models and datasets, we demonstrate that our method can consistently improve the performance of image-text retrieval and achieve new state-of-the-art results. |
| Researcher Affiliation | Academia | 1SKLSDE, School of Computer Science and Engineering, Beihang University, Beijing, China 2Shen Yuan Honors College, Beihang University, Beijing, China 3School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, Canada 4School of Computer Science and Engineering, Southeast University, Nanjing, China |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code and supplementary files can be found at https://github.com/lerogo/aaai24 itr cusa. |
| Open Datasets | Yes | For image-text retrieval, we evaluate our approach on three datasets: Flickr30K (Young et al. 2014), MSCOCO (Lin et al. 2014), and ECCV Caption (Chun et al. 2022). |
| Dataset Splits | No | The paper mentions test sets for datasets like MSCOCO (5K Test Set) and Flickr30K (1K Test Set) and other test sets for image retrieval and STS tasks, but it does not provide specific training/validation split percentages or sample counts for these datasets. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions using 'Unicom-Vi T-B/32' and 'all-mpnet-base-v2' but does not specify exact version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We use the above two losses, CSA and USA together, to adjust the original loss of the ITR model, so the overall loss function is expressed as: LCUSA = Loriginal + α LCSA + β LUSA. where α and β is the loss weight, which ranged from 0.1 to 1.0. |