AXM-Net: Implicit Cross-Modal Feature Alignment for Person Re-identification
Authors: Ammarah Farooq, Muhammad Awais, Josef Kittler, Syed Safwan Khalid4477-4485
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The entire AXM-Net is trained end-to-end on CUHK-PEDES data. We report results on two tasks, person search and cross-modal Re-ID. The AXM-Net outperforms the current state-of-the-art (SOTA) methods and achieves 64.44% Rank@1 on the CUHK-PEDES test set. It also outperforms its competitors by >10% in cross-viewpoint text-to-image Re-ID scenarios on Cross Re-ID and CUHK-SYSU datasets. |
| Researcher Affiliation | Collaboration | Ammarah Farooq1, Muhammad Awais1,2,3, Josef Kittler1,2,3, Syed Safwan Khalid1 1Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, 2Surrey Institute for People-centred AI (SI-PAI), 3Sensus Futuris Ltd. {ammarah.farooq, m.a.rana, j.kittler, s.khalid}@surrey.ac.uk |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement or link to its source code. |
| Open Datasets | Yes | The entire AXM-Net is trained end-to-end on CUHK-PEDES data. CUHK-PEDES: The CUHK person description data (Li et al. 2017b) is the only large-scale benchmark available for cross-modal person search. Cross Re ID Dataset: For cross-modal Re-ID, we evaluate the models on the protocol introduced by (Farooq et al. 2020b) on the test split of CUHK-PEDES data. CUHK-SYSU: We evaluate our model on the test protocol provided by (Farooq et al. 2020a). |
| Dataset Splits | Yes | CUHK-PEDES: It has 13003 person IDs with 40,206 images and 80,440 descriptions. There are 11003,1000,1000 pre-defined IDs for training, validation and test sets. The training and test set include 34054/68126 and 3074/6156 images/descriptions respectively. Cross Re ID Dataset: The dataset includes 824 unique IDs. There are 1511/3022 and 1096/2200 images/descriptions in gallery and query sets respectively. CUHK-SYSU: There are 5532 IDs for training and 2099 IDs for testing. The corresponding descriptions have been extracted from CUHK-PEDES data. The final gallery and query splits contain 5070/10140 and 3271/6550 images/descriptions respectively. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using BERT embedding (Devlin et al. 2018) and word2vec (Mikolov et al. 2013) embedding, but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | We follow a two stage training strategy to train the AXMNet. For the first stage the training follows the standard classification paradigm considering each person as an individual class and only using LIDjoint. We also apply label smoothing to our cross entropy loss. We use batch size 64, weight decay 5e-4 and initial learning rate 0.01 with stochastic gradient descent optimisation. Images are resized to 384 128. Each textual description is mapped to a 768 dimensional BERT embedding (Devlin et al. 2018) and resized as 1 56 768 where 56 is the maximum sentence length. We kept the word embedding layer fixed during training. We adopted random flipping, random erasing for images, and random circular shift of sentences as data augmentation. We used equal contribution(λ) of each loss and margin (α) equal to 0.5. During inference, the vision and text features are extracted separately as the weights of the unified feature learning backbone are independent of data samples. We use the sum of both vision features (VG + VP ) and text feature for evaluation. |