Stochastic Multi-Person 3D Motion Forecasting

Authors: Sirui Xu, Yu-Xiong Wang, Liangyan Gui

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on CMU-Mocap, Mu Po TS-3D, and So Mo F benchmarks show that our approach produces diverse and accurate multi-person predictions, significantly outperforming the state of the art.
Researcher Affiliation Academia Sirui Xu Yu-Xiong Wang Liang-Yan Gui University of Illinois at Urbana-Champaign {siruixu2, yxw, lgui}@illinois.edu
Pseudocode No The paper does not include any pseudocode or algorithm blocks.
Open Source Code No The paper does not explicitly state that the source code for the described methodology is released or provide a link to it. It only mentions that “Part of our code is based on AMCParser (MIT license), attention-is-all-you-need-pytorch (MIT license), and MRT (Wang et al., 2021b) (not specified), and XIA (Guo et al., 2022) (GPL license).”
Open Datasets Yes In the main paper, we show the evaluation on two motion capture datasets, CMU-Mocap (CMU) and Mu Po TS-3D (Mehta et al., 2018). CMU-Mocap consists of movement sequences with up to two subjects for each scene. It contains 2,235 recordings performed by 144 different subjects, eight of which include double-person motion. We directly adopt these two-person motions for comparisons in two-person scenarios. For skeletal representation, we follow Wang et al. (2021b) for the train/test split and the preprocess to mix single-person and double-person motion together to synthesize a 3-person motion in each scene. For meshes generated from SMPL-X representations (Pavlakos et al., 2019), we extract the multi-person data in CMU-Mocap from AMASS (Mahmood et al., 2019) and follow the same strategy to mix single-person and double-person motion together. Mu Po TS-3D consists of more than 8,000 frames with up to three subjects. We convert the data to the same 15-joint human skeleton and length units as CMU-Mocap, and evaluate the generalization on Mu Po TS-3D of a model trained only on CMU-Mocap. We also report our performance on the So Mo F benchmark (Adeli et al., 2020; 2021) in Sec. G of the Appendix.
Dataset Splits No The paper mentions a “train/test split” but does not explicitly provide details about a validation split or percentages for it.
Hardware Specification Yes On one NVIDIA Ge Force GTX TITAN X GPU, training an epoch takes approximately 5 minutes. This work used NVIDIA GPUs at NCSA Delta through allocation CIS220014 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by NSF Grants #2138259, #2138286, #2138307, #2137603, and #2138296.
Software Dependencies No The code is based on Py Torch. We use ADAM (Kingma & Ba, 2014) to train the model. The paper mentions software like PyTorch but does not provide specific version numbers for software dependencies.
Experiment Setup Yes The two hyperparameters (α, β) in diversity promoting loss (Sec. C) are set to (50, 100). The model is trained using a batch size of 32 for 50 epoch, with 6000 training examples per epoch. We use ADAM (Kingma & Ba, 2014) to train the model. For skeletal representation, following Wang et al. (2021b), we train the model to predict a 15-frame sequence of 3 people given the ground truth 15, 30, and 45 past frames at 15Hz, as the encoder (RNN/Transformer) accepts different input lengths. We use a 6-layer transformer (or RNNs), where we set the feature dimension to 128. For evaluation, we recursively predict the next 15 frames 3 times given all past frames generated, as illustrated in Sec. 3.3. Thus, given the number of intents to be M, the model outputs M, M 2, and M 3 different predictions in sequence. For the SMPL representation, we train the model to predict a 25-frame sequence of 3 people given the 10 past frames at 30Hz. We use an 8-layer transformer, where we set the feature dimension to 512.