PLIP: Language-Image Pre-training for Person Representation Learning
Authors: Jialong Zuo, Jiahao Hong, Feng Zhang, Changqian Yu, Hanyu Zhou, Changxin Gao, Nong Sang, Jingdong Wang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We pre-train PLIP on SYNTH-PEDES and evaluate our models by spanning downstream person-centric tasks. PLIP not only significantly improves existing methods on all these tasks, but also shows great ability in the zero-shot and domain generalization settings. |
| Researcher Affiliation | Collaboration | Jialong Zuo 1 Jiahao Hong 1 Feng Zhang 1 Changqian Yu 2 Hanyu Zhou 1 Changxin Gao 1 Nong Sang 1 Jingdong Wang 3 1 National Key Laboratory of Multispectral Information Intelligent Processing Technology, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, 2 Skywork AI, 3 Department of Computer Vision, Baidu Inc. |
| Pseudocode | Yes | Algorithm 1 Seed Filter Strategy 1: Input Sin = {IDk}N k=1, where IDk = {xk i , . . . , xk n} 2: for IDk Sin do 3: repeat 4: if min Sim(IDk) < σs then 5: xk j = arg min Sim(IDk); 6: IDk = IDk xk j , xk j Sexclude; 7: end if 8: until min Sim(IDk) σs 9: end for 10: for xk j Sexclude do 11: if max{Sim(xk j , IDi)}k+ra i=k ra σr then 12: IDr = arg max{Sim(xk j , IDi)}k+ra i=k ra; 13: IDr = IDr + xk j ; 14: end if 15: end for 16: for IDk {IDi}N i=1 do 17: if {Sim(IDk, IDk+i)}rb i=1 > σm then 18: {IDk, IDk+i} Smerge 19: end if 20: end for 21: for FIDi Smerge do 22: for FIDj {FIDj}i+rb j=i+1 do 23: if FIDi FIDj = then 24: FIDi = Merge(FIDi, FIDj) 25: end if 26: end for 27: FIDi Sout 28: end for 29: Output Sout; |
| Open Source Code | Yes | Corresponding Author. Project Link: https://github.com/Zplusdragon/PLIP |
| Open Datasets | Yes | Therefore, we construct a new large-scale person dataset with image-text pairs named SYNTH-PEDES based on the LUPerson-NL and LPW datasets [30, 83]. |
| Dataset Splits | Yes | The training set has 34,054 images of 11,003 identities. The validation and test set have 3,078 and 3,074 images of 1,000 identities, respectively. |
| Hardware Specification | Yes | We train our model on 4 Geforce 3090 GPUs for 70 epochs. |
| Software Dependencies | No | The pre-trained BERT [23] is utilized as the textual encoder with the last 5 layers unfrozen. |
| Experiment Setup | Yes | During the training of PLIP, we adopt four types of backbone as the visual encoder, i.e., Res Net50, Res Net101, Res Net152 and Swin Transformer Base. The pre-trained BERT [23] is utilized as the textual encoder and we only unfreeze the last 5 layers, keeping other parameters frozen. All images are resized to 256 × 128 and normalized with mean and std of [0.357, 0.323, 0.328], [0.252, 0.242, 0.239], which are calculated from our proposed SYNTH-PEDES. We adopt horizontally flipping to augment data, where each image has 50% probability to flip randomly. For Res Net50, we train our model on 4 Geforce 3090 GPUs for 70 epochs with a batch size of 512 totally, which takes approximately 15.2 days. The base learning rate is set to 0.002 and decreased by 0.1 at the epoch of 30 and 50. Besides, the learning rate warm-up strategy is adopted in the first 10 epochs. The learning rate of BERT has a 0.1 decay. For other types of visual encoder, there are some differences on the learning rate and batch size setting. The hyper-parameters in the objective function are set to λ1 = 0.02 and λ2 = 0.1. The optimizer is Adan [100] with the default setting. We adopt the mixed precision training mode by Apex. |