ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
Authors: Yufei Xu, Jing Zhang, Qiming ZHANG, Dacheng Tao
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that our basic Vi TPose model outperforms representative methods on the challenging MS COCO Keypoint Detection benchmark, while the largest model sets a new state-of-the-art, i.e., 80.9 AP on the MS COCO test-dev set. Comprehensive experiments on popular benchmarks are conducted to study and analyze the capabilities of Vi TPose. |
| Researcher Affiliation | Collaboration | Yufei Xu1 , Jing Zhang1 , Qiming Zhang1, Dacheng Tao2,1 1School of Computer Science, The University of Sydney, Australia 2JD Explore Academy, China |
| Pseudocode | No | The paper describes the model architecture and processes using text and diagrams (Fig. 2), but does not include any formal pseudocode blocks or algorithms. |
| Open Source Code | Yes | The code and models are available at https://github.com/Vi TAE-Transformer/Vi TPose. |
| Open Datasets | Yes | Experimental results show that our basic Vi TPose model outperforms representative methods on the challenging MS COCO Keypoint Detection benchmark... We use MAE [15] to pre-train the backbones with MS COCO [28] and a combination of MS COCO and AI Challenger [41] respectively... Specifically, we use MS COCO [28], AI Challenger [41], and MPII [3] datasets for multi-dataset training. |
| Dataset Splits | Yes | The detection results from Simple Baseline [42] are utilized for evaluating Vi TPose s performance on the MS COCO Keypoint val set. |
| Hardware Specification | Yes | The models are trained on 8 A100 GPUs based on the mmpose codebase [11]. |
| Software Dependencies | No | The paper mentions using the 'mmpose codebase [11]' and the 'Adam W [33] optimizer', but does not provide specific version numbers for these or other software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | The default training setting in mmpose is utilized for training the Vi TPose models, i.e., we use the 256 192 input resolution and Adam W [33] optimizer with a learning rate of 5e-4. Udp [18] is used for post-processing. The models are trained for 210 epochs with a learning rate decay by 10 at the 170th and 200th epoch. We sweep the layer-wise learning rate decay [46] and stochastic drop path rate for each model, and the optimal settings are provided in Table 1. |