reproducibilityindex.ai

ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

Authors: Yufei Xu, Jing Zhang, Qiming ZHANG, Dacheng Tao

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that our basic Vi TPose model outperforms representative methods on the challenging MS COCO Keypoint Detection benchmark, while the largest model sets a new state-of-the-art, i.e., 80.9 AP on the MS COCO test-dev set. Comprehensive experiments on popular benchmarks are conducted to study and analyze the capabilities of Vi TPose.
Researcher Affiliation	Collaboration	Yufei Xu1 , Jing Zhang1 , Qiming Zhang1, Dacheng Tao2,1 1School of Computer Science, The University of Sydney, Australia 2JD Explore Academy, China
Pseudocode	No	The paper describes the model architecture and processes using text and diagrams (Fig. 2), but does not include any formal pseudocode blocks or algorithms.
Open Source Code	Yes	The code and models are available at https://github.com/Vi TAE-Transformer/Vi TPose.
Open Datasets	Yes	Experimental results show that our basic Vi TPose model outperforms representative methods on the challenging MS COCO Keypoint Detection benchmark... We use MAE [15] to pre-train the backbones with MS COCO [28] and a combination of MS COCO and AI Challenger [41] respectively... Specifically, we use MS COCO [28], AI Challenger [41], and MPII [3] datasets for multi-dataset training.
Dataset Splits	Yes	The detection results from Simple Baseline [42] are utilized for evaluating Vi TPose s performance on the MS COCO Keypoint val set.
Hardware Specification	Yes	The models are trained on 8 A100 GPUs based on the mmpose codebase [11].
Software Dependencies	No	The paper mentions using the 'mmpose codebase [11]' and the 'Adam W [33] optimizer', but does not provide specific version numbers for these or other software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	The default training setting in mmpose is utilized for training the Vi TPose models, i.e., we use the 256 192 input resolution and Adam W [33] optimizer with a learning rate of 5e-4. Udp [18] is used for post-processing. The models are trained for 210 epochs with a learning rate decay by 10 at the 170th and 200th epoch. We sweep the layer-wise learning rate decay [46] and stochastic drop path rate for each model, and the optimal settings are provided in Table 1.