Referring Human Pose and Mask Estimation In the Wild
Authors: Bo Miao, Mingtao Feng, Zijie Wu, Mohammed Bennamoun, Yongsheng Gao, Ajmal Mian
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that Uni PHD produces quality results based on user-friendly prompts and achieves top-tier performance on Ref Human val and MS COCO val2017. |
| Researcher Affiliation | Academia | 1University of Western Australia 2Xidian University 3Hunan University 4Griffith University |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks, or clearly labeled algorithm sections. |
| Open Source Code | Yes | https://github.com/bo-miao/Ref Human |
| Open Datasets | Yes | We substantially extend COCO [40] to construct the Ref Human dataset. It contains pose and mask annotations for humans along with text and positional prompts to facilitate the new task of R-HPM. |
| Dataset Splits | Yes | To construct Ref Human train set, we annotate prompts for all humans in MS COCO train2017 set with at least three surrounding people, a minimum of eight visible keypoints, and an area ratio of at least 2%. For the Ref Human val set, we annotate humans in MS COCO val2017 set, excluding those with non-visible keypoints or an area ratio below 1%, as instances below this threshold are often not visually clear and difficult to describe accurately. |
| Hardware Specification | Yes | FPS is measured on RTX 3090 with a batch size of 24. |
| Software Dependencies | No | The paper mentions software components like RoBERTa and Swin Transformer but does not specify exact version numbers for general software dependencies or libraries (e.g., Python, PyTorch/TensorFlow versions). |
| Experiment Setup | Yes | We use the Adam W [48] optimizer with a weight decay of 1 10 4 and train our models on 24GB RTX 3090 GPUs with batch size 16 for 20 epochs. The initial learning rates are set to 1 10 5 for the visual encoder and 1 10 4 for other components, with a rate decay at the 18th epoch by a factor of 10. |