reproducibilityindex.ai

Kinematic-Structure-Preserved Representation for Unsupervised 3D Human Pose Estimation

Authors: Jogendra Nath Kundu, Siddharth Seth, Rahul M V, Mugalodi Rakesh, Venkatesh Babu Radhakrishnan, Anirban Chakraborty11312-11319

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experiments demonstrate our state-of-the-art unsupervised and weakly-supervised pose estimation performance on both Human3.6M and MPI-INF-3DHP datasets. Qualitative results on unseen environments further establish our superior generalization ability.
Researcher Affiliation	Academia	Jogendra Nath Kundu, Siddharth Seth, Rahul M V, Mugalodi Rakesh, R. Venkatesh Babu, Anirban Chakraborty Indian Institute of Science, Bangalore, India {jogendrak, siddharthseth, venky, anirban}@iisc.ac.in, rmvenkat@andrew.cmu.edu, rakeshramesha@gmail.com
Pseudocode	No	The paper provides architectural diagrams and mathematical formulations but does not include any explicit pseudocode or algorithm blocks.
Open Source Code	No	The paper does not mention the release of source code or provide a link to a code repository.
Open Datasets	Yes	Datasets. The base-model is trained on a mixture of two datasets, i.e. Human3.6M and an in-house collection of You Tube videos (also referred as YTube). In contrast to the in-studio H3.6M dataset, YTube contains human subjects in diverse apparel and BG scenes performing varied forms of motion (usually dance forms such as western, modern, contemporary etc.). Note that all samples from H3.6M contribute to the paired dataset Dp, whereas 40% samples in YTube contributed to Dp and rest to Dunp based on the associated BG motion criteria. However, as we do not have ground-truth 3D pose for the samples from YTube (in-the-wild dataset), we use MPI-INF-3DHP (also referred as 3DHP) to quantitatively benchmark generalization of the proposed pose estimation framework.
Dataset Splits	No	The paper describes training on mixed datasets (YTube+H3.6M) and finetuning on H3.6M, and evaluates on standard test protocols. However, it does not provide specific percentages or sample counts for validation splits, nor does it explicitly mention a validation set with detailed split information.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications) used for running experiments.
Software Dependencies	No	The paper mentions using Resnet-50 as a base pose encoder but does not list specific software dependencies with version numbers (e.g., 'Python 3.8, PyTorch 1.9, and CUDA 11.1').
Experiment Setup	Yes	We use Resnet-50 (till res4f) with Image Net-pretrained parameters as the base pose encoder EP , whereas the appearance encoder is designed separately using 10 Convolutions. EP later divides into two parallel branches of fully-connected layers dedicated for vk and c respectively. We use J = 17 for all our experiments as shown in Fig. 1. The channel-wise aggregation of fam (16-channels) and fhm (17-channels) is passed through two convolutional layers to obtain f2D (128-maps), which is then concatenated with fa (512-maps) to form the input for DI (each with 14 14 spatial dimension). Our experiments use different Ada Grad optimizers (learning rate: 0.001) for each individual loss components in alternate training iterations, thereby avoiding any hyper-parameter tuning.