Embodied Scene-aware Human Pose Estimation
Authors: Zhengyi Luo, Shun Iwase, Ye Yuan, Kris Kitani
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate, we use the popular H36M and PROX datasets and achieve high quality pose estimation on the challenging PROX dataset without ever using PROX motion sequences for training. Code and videos are available on the project page. (Abstract) |
| Researcher Affiliation | Academia | 1 Carnegie Mellon University |
| Pseudocode | Yes | ALGORITHM 1: Learning embodied pose estimator via dynamics-regulated training. |
| Open Source Code | Yes | Code and videos are available on the project page. (Abstract) and Code included in supplemental material (Page 11, Section 3a) |
| Open Datasets | Yes | We use the AMASS [22], H36M [10], and Kin_poly [21] datasets to dynamically generate 2D keypoints and 3D pose pairs for training. |
| Dataset Splits | Yes | Our Universal Humanoid Controller is trained on the AMASS dataset training split that contains high-quality SMPL parameters of 11402 sequences curated from the various Mo Cap datasets. For training our pose estimator, we use motions from the AMASS training split, the Kin_poly dataset, and H36M dataset. ... For evaluation, we use videos from the PROX [7] dataset and the test split of H36M. |
| Hardware Specification | Yes | The training process or our embodied pose estimator takes around 2 days on a RTX 3090 with 30 CPU threads. During inference, our network is causal and runs at 10 FPS on an Intel desktop CPU |
| Software Dependencies | No | We use the Mu Jo Co [40] physics simulator (Section 4, Implementation details). No specific version numbers for software dependencies are provided. |
| Experiment Setup | Yes | The training process or our embodied pose estimator takes around 2 days on a RTX 3090 with 30 CPU threads. During inference, our network is causal and runs at 10 FPS... During training, we initialize the agent with the ground-truth 3D location and pose. The H36M and Kin_poly dataset contains simple human-scene interactions... we also randomly sample scenes from the PROX dataset and pair them with motions from the AMASS dataset. ... For each motion sequence, our loss function is defined as L1 distance between the predicted kinematic pose and the ground truth plus a prior loss (Sections 3.3 and 4) |