ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

Authors: Yufei Xu, Jing Zhang, Qiming ZHANG, Dacheng Tao

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that our basic Vi TPose model outperforms representative methods on the challenging MS COCO Keypoint Detection benchmark, while the largest model sets a new state-of-the-art, i.e., 80.9 AP on the MS COCO test-dev set. Comprehensive experiments on popular benchmarks are conducted to study and analyze the capabilities of Vi TPose.
Researcher Affiliation Collaboration Yufei Xu1 , Jing Zhang1 , Qiming Zhang1, Dacheng Tao2,1 1School of Computer Science, The University of Sydney, Australia 2JD Explore Academy, China
Pseudocode No The paper describes the model architecture and processes using text and diagrams (Fig. 2), but does not include any formal pseudocode blocks or algorithms.
Open Source Code Yes The code and models are available at https://github.com/Vi TAE-Transformer/Vi TPose.
Open Datasets Yes Experimental results show that our basic Vi TPose model outperforms representative methods on the challenging MS COCO Keypoint Detection benchmark... We use MAE [15] to pre-train the backbones with MS COCO [28] and a combination of MS COCO and AI Challenger [41] respectively... Specifically, we use MS COCO [28], AI Challenger [41], and MPII [3] datasets for multi-dataset training.
Dataset Splits Yes The detection results from Simple Baseline [42] are utilized for evaluating Vi TPose s performance on the MS COCO Keypoint val set.
Hardware Specification Yes The models are trained on 8 A100 GPUs based on the mmpose codebase [11].
Software Dependencies No The paper mentions using the 'mmpose codebase [11]' and the 'Adam W [33] optimizer', but does not provide specific version numbers for these or other software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes The default training setting in mmpose is utilized for training the Vi TPose models, i.e., we use the 256 192 input resolution and Adam W [33] optimizer with a learning rate of 5e-4. Udp [18] is used for post-processing. The models are trained for 210 epochs with a learning rate decay by 10 at the 170th and 200th epoch. We sweep the layer-wise learning rate decay [46] and stochastic drop path rate for each model, and the optimal settings are provided in Table 1.