Unifying Multi-Modal Uncertainty Modeling and Semantic Alignment for Text-to-Image Person Re-identification

Authors: Zhiwei Zhao, Bin Liu, Yan Lu, Qi Chu, Nenghai Yu

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments conducted on three TI-Re ID datasets highlight the effectiveness and superiority of our method over state-of-the-arts. Experiments Experimental Setup
Researcher Affiliation Collaboration 1School of Cyber Science and Technology, University of Science and Technology of China 2CAS Key Laboratory of Electromagnetic Space Information 3Shanghai AI Laboratory
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statement or link indicating that the source code for the methodology is openly available.
Open Datasets Yes CUHK-PEDES (Li et al. 2017) has 40,206 images and 80,412 textual descriptions associated with 13,003 identities. The training set has 11,003 identities with 34,054 images and 68,108 textual descriptions. The validation and test set comprise 3,078 and 3,074 images, along with 6,158 and 6,156 textual descriptions, respectively. Both the val/test subsets have 1,000 identities. RSTPReid (Zhu et al. 2021) comprises 20,505 images, showcasing 4,101 unique identities. ICFG-PEDES (Ding et al. 2021) is a identity-centric TIRe ID dataset, featuring 54,522 images across 4,102 unique identities.
Dataset Splits Yes CUHK-PEDES (Li et al. 2017) has 40,206 images and 80,412 textual descriptions associated with 13,003 identities. The training set has 11,003 identities with 34,054 images and 68,108 textual descriptions. The validation and test set comprise 3,078 and 3,074 images, along with 6,158 and 6,156 textual descriptions, respectively. Both the val/test subsets have 1,000 identities. RSTPReid (Zhu et al. 2021)... The dataset utilizes 3,701, 200 and 200 identities for training, validation, and testing, respectively. ICFG-PEDES (Ding et al. 2021)... The dataset is divided into a training set with 34,674 images from 3,102 identities and a test set containing 19,848 images representing 1,000 identities.
Hardware Specification Yes Our approach is implemented using the PyTorch framework on a single NVIDIA RTX3090 GPU(24G).
Software Dependencies No The paper mentions 'PyTorch framework' but does not specify a version number for PyTorch or any other software dependency.
Experiment Setup Yes Implementation Details. Our approach is implemented using the Py Torch framework on a single NVIDIA RTX3090 GPU(24G). Similar to the IRRA method (Jiang and Ye 2023), our model comprises a pre-trained image encoder (CLIP-Vi T-B/16), a pre-trained text encoder (CLIP text Transformer), and a randomly initialized multimodal interaction encoder. During training, all input images are resized to 384 128, the patch and stride size are set to 16. We apply the random horizontal flipping, Rand Augment (Cubuk et al. 2020), and random erasing (Zhong et al. 2020) for image augmentation. The batchsize is set to 64. The maximum length of the textual token sequence is 77. Our model is trained using Adam optimizer (Kingma and Ba 2014) for 60 epochs, with a learning rate initialized at 1 10 5 and cosine learning rate decay. The learning rate is gradually increased from 1 10 6 to 1 10 5 over the 5 warm-up epochs. For the MUM module, both the coupling factor ω and the scale factor s are set to 0.25. The MUM module is only applied during training phase for feature augmentation. During testing phase, we do not use this module. The mask rate of input text token during training phase is set to 30% for CUHK-PEDES and ICFG-PEDES, and 15% for RSTPReid. During the testing phase, the input texts is not masked. The hyper-parameters γ and m in the cm-Circle loss are empirically set to 64 and 0.35. The weight λ1 of cm-Circle loss is set to 2.0 for ICFG-PEDES and RSTPReid, and 0.25 for CUHK-PEDES. The weight λ2 of cm-GSR loss is set to 0.5.