Enhancing Cross-modal Completion and Alignment for Unsupervised Incomplete Text-to-Image Person Retrieval

Authors: Tiantian Gong, Junsheng Wang, Liyan Zhang

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on public datasets, fully demonstrate the consistent superiority of our method over SOTA text-image person retrieval methods.
Researcher Affiliation Academia Tiantian Gong1 , Junsheng Wang2 , Liyan Zhang1 1Nanjing University of Aeronautics and Astronautics 2Nanjing University of Science and Technology
Pseudocode No The paper describes the proposed method in detail in Section 3, but it does not include a formally labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code No The paper does not contain any explicit statements about making the source code available or providing a link to a code repository.
Open Datasets Yes CUHK-PEDES [Li et al., 2017b] comprises 40,206 pedestrian images along with 80,412 text descriptions corresponding to 13,003 distinct pedestrian identities. ... ICFG-PEDES [Ding et al., 2021] comprises 54,522 images with 4,102 distinct identities.
Dataset Splits Yes Challenging Data Partitions. We define three distinct settings to represent varying levels of difficulty. For the easy setting, we use 50% of the training set as the complete image-text pair data, 25% as missing image data, and 25% as missing text data, denoted as (50%, 25%, 25%). Similarly, we establish the medium setting, defined as (30%, 35%, 35%), and the hard setting as (10%, 45%, 45%) to elevate the training complexity.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, or memory specifications).
Software Dependencies No The paper mentions using 'the image encoder and text encoder components of the Clip [Radford et al., 2021] model' and 'Adam optimizer [Kingma and Ba, 2014]' and NLTK [Loper and Bird, 2002], but it does not specify version numbers for any of these software dependencies.
Experiment Setup Yes All images are resized to 384 × 128 pixels. For the text modality, the maximum length of text tokens is set to 80. The model is optimized via the Adam optimizer [Kingma and Ba, 2014] with a 0.0001 learning ratio. The batch size is set to 64, and the training process spans across a total of 60 epochs. The temperature parameter τ (Equations 19 and 24) is set to 0.02.