CLIP-ReID: Exploiting Vision-Language Model for Image Re-identification without Concrete Text Labels

Authors: Siyuan Li, Li Sun, Qingli Li

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The effectiveness of the proposed strategy is validated on several datasets for the person or vehicle Re ID tasks. Experiments Datasets and Evaluation Protocols We evaluate our method on four person re-identification datasets, including MSMT17 (Wei et al. 2018), Market-1501 (Zheng et al. 2015), Duke MTMC-re ID (Ristani et al. 2016), Occluded-Duke (Miao et al. 2019), and two vehicle Re ID datasets, Ve Ri-776 (Liu et al. 2016b) and Vehicle ID (Liu et al. 2016a).
Researcher Affiliation Academia 1 Shanghai Key Laboratory of Multidimensional Information Processing, East China Normal University 2 Key Laboratory of Advanced Theory and Application in Statistics and Data Science, East China Normal University
Pseudocode Yes Algorithm 1: CLIP-Re ID s training process.
Open Source Code Yes Code is available at https://github.com/Syliz517/CLIP-Re ID.
Open Datasets Yes We evaluate our method on four person re-identification datasets, including MSMT17 (Wei et al. 2018), Market-1501 (Zheng et al. 2015), Duke MTMC-re ID (Ristani et al. 2016), Occluded-Duke (Miao et al. 2019), and two vehicle Re ID datasets, Ve Ri-776 (Liu et al. 2016b) and Vehicle ID (Liu et al. 2016a).
Dataset Splits No Following common practices, we adapt the cumulative matching characteristics (CMC) at Rank-1 (R1) and the mean average precision (m AP) to evaluate the performance. The paper mentions training details and evaluation metrics but does not explicitly specify "validation" dataset splits or how data is partitioned for validation.
Hardware Specification No No specific hardware details like GPU/CPU models, memory, or processing units were mentioned for the experimental setup.
Software Dependencies No No specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9, CUDA 11.1) were explicitly stated.
Experiment Setup Yes In the first training stage, we use the Adam optimizer for both the CNN-based and the Vi T-based models, with a learning rate initialized at 3.5 10 4 and decayed by a cosine schedule. At this stage, the batch size is set to 64 without using any augmentation methods. Only the learnable text tokens [X]1[X]2[X]3...[X]M are optimizable. In the second training stage (same as our baseline), Adam optimizer is also used to train the image encoder. Each minibatch consists of B = P K images, where P is the number of randomly selected identities, and K is samples per identity. We take P = 16 and K = 4. Each image is augmented by random horizontal flipping, padding, cropping and erasing (Zhong et al. 2020). For the CNN-based model, we spend 10 epochs linearly increasing the learning rate from 3.5 10 6 to 3.5 10 4, and then the learning rate is decayed by 0.1 at the 40th and 70th epochs. For the Vi T-based model, we warm up the model for 10 epochs with a linearly growing learning rate from 5 10 7 to 5 10 6. Then, it is decreased by a factor of 0.1 at the 30th and 50th epochs. We train the CNN-based model for 120 epochs while the Vi T-based model for 60 epochs. For the CNN-based model, we use Ltri and Lid pre and post the global attention pooling layer, and α is set to 0.3. Similarly, we use them pre and post the linear layer after the transformer. Note that we also employ Ltri after the 11th transformer layer of Vi T-B/16 and the 3rd residual layer of Res Net-50.