MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild

Authors: Gregory Rogez, Cordelia Schmid

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We address 3D pose estimation in the wild. However, there does not exist a dataset of real-world images with 3D annotations. We thus evaluate our method in two different settings using existing datasets: (1) we validate our 3D pose predictions using Human3.6M [13]... (2) we evaluate on Leeds Sport dataset (LSP)[16]... Our method outperforms the state of the art in terms of 3D pose estimation in controlled environments (Human3.6M) and shows promising results for in-the-wild images (LSP).
Researcher Affiliation Academia Grégory Rogez Cordelia Schmid Inria Grenoble Rhône-Alpes, Laboratoire Jean Kuntzmann, France
Pseudocode No The paper describes the synthesis engine and CNN architecture but does not include any explicit pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about releasing source code or a link to a code repository.
Open Datasets Yes We use the CMU Motion Capture Dataset2 and the Human3.6M 3D poses [13], and for 2D pose annotations the MPII-LSP-extended dataset [24] and the Human3.6M 2D poses and images.
Dataset Splits Yes We follow the protocol introduced in [18] and employed in [42]: we consider six subjects (S1, S5, S6, S7, S8 and S9) for training, use every 64th frame of subject S11 for testing and evaluate the 3D pose error (mm) averaged over the 13 joints.
Hardware Specification No We acknowledge the support of NVIDIA with the donation of the GPUs used for this research.
Software Dependencies No The paper mentions using Alex Net CNN architecture [19] and VGG-16 architecture [33] but does not provide specific software environment or library version numbers.
Experiment Setup Yes We empirically found that K=5000 clusters was a sufficient number of clusters. Given a library of Mo Cap data and a set of camera views, we synthesize for each 3D pose a 220x220 image. We found that it performed better than just fine-tuning a model pre-trained on Imagenet (3D error of 88.1mm vs 98.3mm with fine-tuning).