CLIP-ReID: Exploiting Vision-Language Model for Image Re-identification without Concrete Text Labels
Authors: Siyuan Li, Li Sun, Qingli Li
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The effectiveness of the proposed strategy is validated on several datasets for the person or vehicle Re ID tasks. Experiments Datasets and Evaluation Protocols We evaluate our method on four person re-identification datasets, including MSMT17 (Wei et al. 2018), Market-1501 (Zheng et al. 2015), Duke MTMC-re ID (Ristani et al. 2016), Occluded-Duke (Miao et al. 2019), and two vehicle Re ID datasets, Ve Ri-776 (Liu et al. 2016b) and Vehicle ID (Liu et al. 2016a). |
| Researcher Affiliation | Academia | 1 Shanghai Key Laboratory of Multidimensional Information Processing, East China Normal University 2 Key Laboratory of Advanced Theory and Application in Statistics and Data Science, East China Normal University |
| Pseudocode | Yes | Algorithm 1: CLIP-Re ID s training process. |
| Open Source Code | Yes | Code is available at https://github.com/Syliz517/CLIP-Re ID. |
| Open Datasets | Yes | We evaluate our method on four person re-identification datasets, including MSMT17 (Wei et al. 2018), Market-1501 (Zheng et al. 2015), Duke MTMC-re ID (Ristani et al. 2016), Occluded-Duke (Miao et al. 2019), and two vehicle Re ID datasets, Ve Ri-776 (Liu et al. 2016b) and Vehicle ID (Liu et al. 2016a). |
| Dataset Splits | No | Following common practices, we adapt the cumulative matching characteristics (CMC) at Rank-1 (R1) and the mean average precision (m AP) to evaluate the performance. The paper mentions training details and evaluation metrics but does not explicitly specify "validation" dataset splits or how data is partitioned for validation. |
| Hardware Specification | No | No specific hardware details like GPU/CPU models, memory, or processing units were mentioned for the experimental setup. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9, CUDA 11.1) were explicitly stated. |
| Experiment Setup | Yes | In the first training stage, we use the Adam optimizer for both the CNN-based and the Vi T-based models, with a learning rate initialized at 3.5 10 4 and decayed by a cosine schedule. At this stage, the batch size is set to 64 without using any augmentation methods. Only the learnable text tokens [X]1[X]2[X]3...[X]M are optimizable. In the second training stage (same as our baseline), Adam optimizer is also used to train the image encoder. Each minibatch consists of B = P K images, where P is the number of randomly selected identities, and K is samples per identity. We take P = 16 and K = 4. Each image is augmented by random horizontal flipping, padding, cropping and erasing (Zhong et al. 2020). For the CNN-based model, we spend 10 epochs linearly increasing the learning rate from 3.5 10 6 to 3.5 10 4, and then the learning rate is decayed by 0.1 at the 40th and 70th epochs. For the Vi T-based model, we warm up the model for 10 epochs with a linearly growing learning rate from 5 10 7 to 5 10 6. Then, it is decreased by a factor of 0.1 at the 30th and 50th epochs. We train the CNN-based model for 120 epochs while the Vi T-based model for 60 epochs. For the CNN-based model, we use Ltri and Lid pre and post the global attention pooling layer, and α is set to 0.3. Similarly, we use them pre and post the linear layer after the transformer. Note that we also employ Ltri after the 11th transformer layer of Vi T-B/16 and the 3rd residual layer of Res Net-50. |