HAP: Structure-Aware Masked Image Modeling for Human-Centric Perception
Authors: Junkun Yuan, Xinyu Zhang, Hao Zhou, Jian Wang, Zhongwei Qiu, Zhiyin Shao, Shaofeng Zhang, Sifan Long, Kun Kuang, Kun Yao, Junyu Han, Errui Ding, Lanfen Lin, Fei Wu, Jingdong Wang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | HAP simply uses a plain Vi T as the encoder yet establishes new state-of-the-art performance on 11 human-centric benchmarks, and on-par result on one dataset. For example, HAP achieves 78.1% m AP on MSMT17 for person re-identification, 86.54% m A on PA-100K for pedestrian attribute recognition, 78.2% AP on MS COCO for 2D pose estimation, and 56.0 PA-MPJPE on 3DPW for 3D pose and shape estimation. 4 Experiments |
| Researcher Affiliation | Collaboration | Junkun Yuan1,2 , Xinyu Zhang2 , Hao Zhou2, Jian Wang2, Zhongwei Qiu3, Zhiyin Shao4, Shaofeng Zhang5, Sifan Long6, Kun Kuang1 , Kun Yao2, Junyu Han2, Errui Ding2, Lanfen Lin1, Fei Wu1, Jingdong Wang2 1Zhejiang University 2Baidu VIS 3University of Science and Technology Beijing 4South China University of Technology 5Shanghai Jiao Tong University 6Jilin University |
| Pseudocode | No | The paper describes the HAP framework and its components (e.g., Figure 2) but does not provide pseudocode or a clearly labeled algorithm block. |
| Open Source Code | Yes | Code: https://github.com/junkunyuan/HAP |
| Open Datasets | Yes | LUPerson [23] is a large-scale person dataset, consisting of about 4.2M images of over 200K persons across different environments. Following [51], we use the subset of LUPerson with 2.1M images for pre-training. |
| Dataset Splits | Yes | We evaluate our HAP on 12 benchmarks across 5 human-centric perception tasks, including person Re ID on Market-1501 [87] and MSMT17 [74], 2D pose estimation on MPII [1], COCO [46] and AIC [75], text-to-image person Re ID on CUHK-PEDES [42], ICFGPEDES [20] and RSTPReid [89], 3D pose and shape estimation on 3DPW [67], pedestrian attribute recognition on PA-100K [47], RAP [39] and PETA [18]. |
| Hardware Specification | No | The paper describes training settings such as batch size and epochs, but does not specify any hardware details (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions software components such as Adam W [49], BERT [19], and Bi-LSTM [27], and uses Vi TPose [78] for keypoint extraction, but it does not specify version numbers for these or other software dependencies, such as the Python or PyTorch version used. |
| Experiment Setup | Yes | The resolution of the input image is set to 256 128 and the batch size is set to 4096. The encoder model structure of HAP is based on the Vi T-Base [21]. HAP adopts Adam W [49] as the optimizer in which the weight decay is set to 0.05. We use cosine decay learning rate schedule [48], and the base learning rate is set to 1.5e-4. The warmup epochs are set to 40 and the total epochs are set to 400. Table 5: Hyper-parameters of pre-training. |