HAP: Structure-Aware Masked Image Modeling for Human-Centric Perception

Authors: Junkun Yuan, Xinyu Zhang, Hao Zhou, Jian Wang, Zhongwei Qiu, Zhiyin Shao, Shaofeng Zhang, Sifan Long, Kun Kuang, Kun Yao, Junyu Han, Errui Ding, Lanfen Lin, Fei Wu, Jingdong Wang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental HAP simply uses a plain Vi T as the encoder yet establishes new state-of-the-art performance on 11 human-centric benchmarks, and on-par result on one dataset. For example, HAP achieves 78.1% m AP on MSMT17 for person re-identification, 86.54% m A on PA-100K for pedestrian attribute recognition, 78.2% AP on MS COCO for 2D pose estimation, and 56.0 PA-MPJPE on 3DPW for 3D pose and shape estimation. 4 Experiments
Researcher Affiliation Collaboration Junkun Yuan1,2 , Xinyu Zhang2 , Hao Zhou2, Jian Wang2, Zhongwei Qiu3, Zhiyin Shao4, Shaofeng Zhang5, Sifan Long6, Kun Kuang1 , Kun Yao2, Junyu Han2, Errui Ding2, Lanfen Lin1, Fei Wu1, Jingdong Wang2 1Zhejiang University 2Baidu VIS 3University of Science and Technology Beijing 4South China University of Technology 5Shanghai Jiao Tong University 6Jilin University
Pseudocode No The paper describes the HAP framework and its components (e.g., Figure 2) but does not provide pseudocode or a clearly labeled algorithm block.
Open Source Code Yes Code: https://github.com/junkunyuan/HAP
Open Datasets Yes LUPerson [23] is a large-scale person dataset, consisting of about 4.2M images of over 200K persons across different environments. Following [51], we use the subset of LUPerson with 2.1M images for pre-training.
Dataset Splits Yes We evaluate our HAP on 12 benchmarks across 5 human-centric perception tasks, including person Re ID on Market-1501 [87] and MSMT17 [74], 2D pose estimation on MPII [1], COCO [46] and AIC [75], text-to-image person Re ID on CUHK-PEDES [42], ICFGPEDES [20] and RSTPReid [89], 3D pose and shape estimation on 3DPW [67], pedestrian attribute recognition on PA-100K [47], RAP [39] and PETA [18].
Hardware Specification No The paper describes training settings such as batch size and epochs, but does not specify any hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions software components such as Adam W [49], BERT [19], and Bi-LSTM [27], and uses Vi TPose [78] for keypoint extraction, but it does not specify version numbers for these or other software dependencies, such as the Python or PyTorch version used.
Experiment Setup Yes The resolution of the input image is set to 256 128 and the batch size is set to 4096. The encoder model structure of HAP is based on the Vi T-Base [21]. HAP adopts Adam W [49] as the optimizer in which the weight decay is set to 0.05. We use cosine decay learning rate schedule [48], and the base learning rate is set to 1.5e-4. The warmup epochs are set to 40 and the total epochs are set to 400. Table 5: Hyper-parameters of pre-training.