Empowering Visible-Infrared Person Re-Identification with Large Foundation Models

Authors: Zhangyi Hu, Bin Yang, Mang Ye

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on three expanded VI-Re ID datasets demonstrate that our method significantly improves the retrieval performance, paving the way for the utilization of large foundation models in downstream multi-modal retrieval tasks.
Researcher Affiliation Academia 1National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan, China. {zhangyi_hu,yangbin_cv,yemang}@whu.edu.cn
Pseudocode No The paper does not include pseudocode or clearly labeled algorithm blocks.
Open Source Code No We will release our data and code soon in https://github.com/WHU-HZY/ TVI-LFM.
Open Datasets Yes SYSU-MM01 [46]... URL: The dataset can be accessed through a Git Hub repository: https://github. com/wuancong/SYSU-MM01... LLCM [65]... URL: The dataset is available on Git Hub https://github.com/ZYK100/LLCM... Reg DB [31]... URL: We can only find the paper s doi https://doi.org/10.3390/s17030605
Dataset Splits Yes To get stable performance on Tri-SYSU-MM01 and Tri-LLCM, we evaluate our model 10 times with random splits of the gallery set; as for Tri-Reg DB, we evaluate our model on 10 trials with different train/test splits and report the average performance on each dataset. The training set contains 22,258 visible images and 11,909 infrared images of 395 identities. The testing set contains 96 identities, with 3,803 infrared images for query and 301 (single-shot) randomly selected visible images as the gallery set.
Hardware Specification Yes We implement our framework in Py Torch [32] utilizing a single NVIDIA RTX 3090 GPU for training.
Software Dependencies No The paper mentions PyTorch [32], Blip [24], and Vicuna [67] but does not specify their version numbers, which are required for a reproducible description of software dependencies.
Experiment Setup Yes Each batch consists of 8 identities, with each identity containing 4 visible images, 4 infrared images, 4 text descriptions generated from visible images, and 4 text descriptions generated from infrared images. All input images are resized to 3 288 144, with full augmentation strategy the same as CAJ [57]... We use the Adam [17] for optimization. For the Tri-SYSU-MM01 and Tri-LLCM datasets, in both visual and textual parts, the learning rate is set to 3.5e-4 and the weight decay to 5e-4. For the Tri-Reg DB dataset, the learning rate for the visual part is 2e-3 with weight decay of 5e-4, and for the textual part, the learning rate is 1e-5 with weight decay of 4e-5. The learning rate rises up to the initial value by a linear warm-up scheme for the first 10 epochs, then decays by a linear scheme with a decay-factor of 0.1 at the milestones of 40, 60, and 100 epochs.