Diverse Person: Customize Your Own Dataset for Text-Based Person Search
Authors: Zifan Song, Guosheng Hu, Cairong Zhao
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results demonstrate that the baseline models trained with our DP can achieve new state-of-the-art results on three public datasets, with performance improvements up to 4.82%, 2.15%, and 2.28% on CUHK-PEDES, ICFG-PEDES, and RSTPReid in terms of Rank-1 accuracy, respectively. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science and Technology, Tongji University, China 2Oosto, Belfast, U.K., BT1 2BE |
| Pseudocode | No | The paper describes the proposed method in prose and with figures, but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | We access three widely used benchmark datasets for text-based person search to evaluate our approaches. CUHK-PEDES (Li et al. 2017) comprises 40,206 images and 80,412 text descriptions of 13,003 persons. ICFG-PEDES (Ding et al. 2021) contains 54,522 images of 4,102 persons collected from the MSMT17 (Wei et al. 2018) database, with each image having a corresponding text description of an average length of 37 words. RSTPReid (Zhu et al. 2021) consists of 4,101 identities, with each identity having 5 corresponding images, resulting in a total of 20,505 images. |
| Dataset Splits | Yes | Specifically, 34,054 images of 11,003 persons and corresponding 68,108 text descriptions are used as the training set. The remaining 2000 persons are equally divided into the validation and testing sets, with the validation set comprising 3,078 images and 6,156 text descriptions, and the testing set comprising 3,074 images and 6,148 text descriptions. For the dataset split, the training set, validation set, and test set contain 3,701, 200, and 200 identities, respectively. |
| Hardware Specification | Yes | During the training of the diffusion models, we use the pre-trained Stable Diffusion v1-5 and conduct model training for 30k steps on 2 NVIDIA RTX 3090 GPUs. |
| Software Dependencies | No | The paper mentions software like PyTorch, Stable Diffusion v1-5, CLIP, and BERT, but does not provide specific version numbers for these software dependencies to ensure reproducibility. |
| Experiment Setup | Yes | We set a maximum of 3 reference attributes with a learning rate of 1e-5 and a batch size of 32. The regularization loss is employed to the downsampled cross-attention maps and the hyperparameter λ is set to 0.001. For the hyperparameter γ, we set it to 0.73 ([0.7, 0.75] also achieve optimal results), to achieve a balance between diversity and consistency with the reference attributes. All images are resized to 384 × 128 before being fed into the image encoders, followed by random horizontally flipping, random crop with padding, and random erasing for data augmentations. For the text encoder BERT (Vaswani et al. 2017), the BERT encoder is frozen and the dimensions of textual features are 768. The sizes of the vocabulary vary for different databases, with CUHK-PEDES set to 5000 and ICFG-PEDES set to 3000. |