Kinematic-Structure-Preserved Representation for Unsupervised 3D Human Pose Estimation
Authors: Jogendra Nath Kundu, Siddharth Seth, Rahul M V, Mugalodi Rakesh, Venkatesh Babu Radhakrishnan, Anirban Chakraborty11312-11319
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experiments demonstrate our state-of-the-art unsupervised and weakly-supervised pose estimation performance on both Human3.6M and MPI-INF-3DHP datasets. Qualitative results on unseen environments further establish our superior generalization ability. |
| Researcher Affiliation | Academia | Jogendra Nath Kundu, Siddharth Seth, Rahul M V, Mugalodi Rakesh, R. Venkatesh Babu, Anirban Chakraborty Indian Institute of Science, Bangalore, India {jogendrak, siddharthseth, venky, anirban}@iisc.ac.in, rmvenkat@andrew.cmu.edu, rakeshramesha@gmail.com |
| Pseudocode | No | The paper provides architectural diagrams and mathematical formulations but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not mention the release of source code or provide a link to a code repository. |
| Open Datasets | Yes | Datasets. The base-model is trained on a mixture of two datasets, i.e. Human3.6M and an in-house collection of You Tube videos (also referred as YTube). In contrast to the in-studio H3.6M dataset, YTube contains human subjects in diverse apparel and BG scenes performing varied forms of motion (usually dance forms such as western, modern, contemporary etc.). Note that all samples from H3.6M contribute to the paired dataset Dp, whereas 40% samples in YTube contributed to Dp and rest to Dunp based on the associated BG motion criteria. However, as we do not have ground-truth 3D pose for the samples from YTube (in-the-wild dataset), we use MPI-INF-3DHP (also referred as 3DHP) to quantitatively benchmark generalization of the proposed pose estimation framework. |
| Dataset Splits | No | The paper describes training on mixed datasets (YTube+H3.6M) and finetuning on H3.6M, and evaluates on standard test protocols. However, it does not provide specific percentages or sample counts for validation splits, nor does it explicitly mention a validation set with detailed split information. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications) used for running experiments. |
| Software Dependencies | No | The paper mentions using Resnet-50 as a base pose encoder but does not list specific software dependencies with version numbers (e.g., 'Python 3.8, PyTorch 1.9, and CUDA 11.1'). |
| Experiment Setup | Yes | We use Resnet-50 (till res4f) with Image Net-pretrained parameters as the base pose encoder EP , whereas the appearance encoder is designed separately using 10 Convolutions. EP later divides into two parallel branches of fully-connected layers dedicated for vk and c respectively. We use J = 17 for all our experiments as shown in Fig. 1. The channel-wise aggregation of fam (16-channels) and fhm (17-channels) is passed through two convolutional layers to obtain f2D (128-maps), which is then concatenated with fa (512-maps) to form the input for DI (each with 14 14 spatial dimension). Our experiments use different Ada Grad optimizers (learning rate: 0.001) for each individual loss components in alternate training iterations, thereby avoiding any hyper-parameter tuning. |