MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild
Authors: Gregory Rogez, Cordelia Schmid
NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We address 3D pose estimation in the wild. However, there does not exist a dataset of real-world images with 3D annotations. We thus evaluate our method in two different settings using existing datasets: (1) we validate our 3D pose predictions using Human3.6M [13]... (2) we evaluate on Leeds Sport dataset (LSP)[16]... Our method outperforms the state of the art in terms of 3D pose estimation in controlled environments (Human3.6M) and shows promising results for in-the-wild images (LSP). |
| Researcher Affiliation | Academia | Grégory Rogez Cordelia Schmid Inria Grenoble Rhône-Alpes, Laboratoire Jean Kuntzmann, France |
| Pseudocode | No | The paper describes the synthesis engine and CNN architecture but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code or a link to a code repository. |
| Open Datasets | Yes | We use the CMU Motion Capture Dataset2 and the Human3.6M 3D poses [13], and for 2D pose annotations the MPII-LSP-extended dataset [24] and the Human3.6M 2D poses and images. |
| Dataset Splits | Yes | We follow the protocol introduced in [18] and employed in [42]: we consider six subjects (S1, S5, S6, S7, S8 and S9) for training, use every 64th frame of subject S11 for testing and evaluate the 3D pose error (mm) averaged over the 13 joints. |
| Hardware Specification | No | We acknowledge the support of NVIDIA with the donation of the GPUs used for this research. |
| Software Dependencies | No | The paper mentions using Alex Net CNN architecture [19] and VGG-16 architecture [33] but does not provide specific software environment or library version numbers. |
| Experiment Setup | Yes | We empirically found that K=5000 clusters was a sufficient number of clusters. Given a library of Mo Cap data and a set of camera views, we synthesize for each 3D pose a 220x220 image. We found that it performed better than just fine-tuning a model pre-trained on Imagenet (3D error of 88.1mm vs 98.3mm with fine-tuning). |