Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
LLaVA-ReID: Selective Multi-image Questioner for Interactive Person Re-Identification
Authors: Yiding Lu, Mouxing Yang, Dezhong Peng, Peng Hu, Yijie Lin, Xi Peng
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on both Inter-Re ID and text-based Re ID benchmarks demonstrate that LLa VA-Re ID significantly outperforms baselines. As shown in Table 1, LLa VA-Re ID achieves superior performance compared to baselines. In particular, it improves R@1 by 28.1% and 37.34% after 3 and 5 rounds of interaction (compared to Initial), respectively. |
| Researcher Affiliation | Academia | 1College of Computer Science, Sichuan University, China 2National Key Laboratory of Fundamental Algorithms and Models for Engineering Numerical Simulation, Sichuan University, China. Correspondence to: Yijie Lin <EMAIL>, Xi Peng <EMAIL>. |
| Pseudocode | No | The paper describes methods textually and with equations and figures (e.g., Figure 2 illustrates the data construction pipeline and Figure 3 shows the architecture of the selector), but does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block with structured steps formatted like code. |
| Open Source Code | Yes | https://github.com/XLearning-SCU/LLaVA-ReID |
| Open Datasets | Yes | To facilitate the study of this novel task, we construct a new dataset, Interactive-PEDES, which incorporates: i) coarsegrained descriptions to simulate the initial, partial queries provided by witnesses, ii) fine-grained descriptions that capture rich, detailed visual characteristics, simulating the witness s latent memories, and iii) multi-round dialogues derived by decomposing fine-grained descriptions into diverse questions, addressing detailed attributes of individuals. ... The dataset comprises 54,749 images of 13,051 individuals, collected from CUHK-PEDES (Li et al., 2017) and ICFG-PEDES (Ding et al., 2021). |
| Dataset Splits | Yes | The training set comprises 47,376 images corresponding to 11,543 identities, while the test set includes 7,373 images representing 1,508 identities. Additional details are provided in Appendix A. |
| Hardware Specification | Yes | All experiments are conducted on an Ubuntu 20.04 system with NVIDIA 4090 GPUs. |
| Software Dependencies | No | The paper mentions several frameworks and models used (CLIP, IRRA, LLa VA-One Vision-Qwen2-7B-ov, QLoRA, Qwen2.5-7B-Instruct) and the operating system (Ubuntu 20.04). However, it does not provide specific version numbers for software libraries or packages like PyTorch, TensorFlow, or other explicit programming language versions. |
| Experiment Setup | Yes | For the Retriever, we adopt CLIP (Radford et al., 2021) as the backbone and train it using the IRRA (Jiang & Ye, 2023) framework with fine-grained descriptions from Interactive PEDES for 30 epochs with a batch size of 128. All other training parameters follow the original IRRA settings. ... For the Questioner, we build our model on LLa VA-One Vision-Qwen2-7B-ov (Li et al., 2024a) and fine-tune it using QLo RA (Dettmers et al., 2023). The model is quantized to 4-bit and Lo RA weights are applied with r = 128 and α = 256. The learning rate is set to 1 × 10−5, with a batch size of 4 and gradient accumulation steps of 4. ... During the interaction process, we limit the question length to 96 tokens and the answer length to 40 tokens. Table 11 provides further training configuration details for LLa VA-Re ID, including lora alpha: 256, lora r: 128, lora dropout: 0.05, epoch: 1, batch size: 4, learning rate: 1e-5, etc. |