Progressive Multi-View Human Mesh Recovery with Self-Supervision

Authors: Xuan Gong, Liangchen Song, Meng Zheng, Benjamin Planche, Terrence Chen, Junsong Yuan, David Doermann, Ziyan Wu

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive benchmarking, we demonstrate the superiority of the proposed solution especially for unseen in-the-wild scenarios. Empirical evaluations show that our method achieves very competitive results on H3.6M (Ionescu et al. 2013), Total Capture (Trumble et al. 2017) and challenging Ski Pose (Sp orri 2016; Rhodin et al. 2018) dataset compared with other fully/weakly supervised multi-view human mesh recovery and human pose estimation methods. Our key contributions can be summarized as: 3) We conduct extensive experiments on standard benchmark datasets and demonstrate comparable numbers with fully/weakly supervised methods on conventional evaluation metrics.
Researcher Affiliation Collaboration Xuan Gong1,2, Liangchen Song1,2, Meng Zheng1, Benjamin Planche1, Terrence Chen1, Junsong Yuan2, David Doermann2, Ziyan Wu1 1 United Imaging Intelligence, Cambridge MA 02140 USA 2 University at Buffalo, Buffalo NY 14260 USA xuangong@buffalo.edu, lsong8@buffalo.edu, meng.zheng@uii-ai.com, benjamin.planche@uii-ai.com, terrence.chen@uii-ai.com, jsyuan@buffalo.edu, doermann@buffalo.edu, ziyan.wu@uii-ai.com
Pseudocode No The paper includes figures illustrating the pipeline, but no structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statements about releasing source code or links to a code repository.
Open Datasets Yes Datasets and Metrics Training data. Following the protocol by Sengupta, Budvytis, and Cipolla (2020), we sample SMPL pose parameters from the training sets of UP-3D (Lassner et al. 2017), 3DPW (von Marcard et al. 2018), and the five training subjects (S1, S5, S6, S7, S8) of Human3.6M (Ionescu et al. 2013).
Dataset Splits Yes Human3.6M: The Human3.6M dataset (Ionescu et al. 2013) provides a total of 3.6 million frames in synchronized four-views. The camera placement is slightly different for each of the seven subjects. We follow the most popular protocol 1, testing on subjects S9, S11. Total Capture: Total Capture dataset (Trumble et al. 2017) consists of 1.9 million frames, captured from 8 calibrated full HD video cameras recording at 60Hz. Following the typical data split (Trumble et al. 2017), we use Walking2 , Freestyle3 , and Acting3 on subjects 1, 2, 3, 4, 5 for testing.
Hardware Specification Yes It takes ~3 days on one A100 GPU.
Software Dependencies No The paper mentions using Adam optimizer, Keypoint-RCNN, and Dense Pose-RCNN, but does not specify version numbers for these software components or any programming languages/libraries (e.g., PyTorch, TensorFlow versions).
Experiment Setup Yes Training is done using Adam (Kingma and Ba 2014) optimizer for 6 epochs with a learning rate of 1e 4 and a batch size of 16. For consistency with training, we crop both the masks and 2D joints heatmaps with a scale of 1.2 before forwarding them to the network for 3D mesh inference.