Stochastic Multi-Person 3D Motion Forecasting
Authors: Sirui Xu, Yu-Xiong Wang, Liangyan Gui
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on CMU-Mocap, Mu Po TS-3D, and So Mo F benchmarks show that our approach produces diverse and accurate multi-person predictions, significantly outperforming the state of the art. |
| Researcher Affiliation | Academia | Sirui Xu Yu-Xiong Wang Liang-Yan Gui University of Illinois at Urbana-Champaign {siruixu2, yxw, lgui}@illinois.edu |
| Pseudocode | No | The paper does not include any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not explicitly state that the source code for the described methodology is released or provide a link to it. It only mentions that “Part of our code is based on AMCParser (MIT license), attention-is-all-you-need-pytorch (MIT license), and MRT (Wang et al., 2021b) (not specified), and XIA (Guo et al., 2022) (GPL license).” |
| Open Datasets | Yes | In the main paper, we show the evaluation on two motion capture datasets, CMU-Mocap (CMU) and Mu Po TS-3D (Mehta et al., 2018). CMU-Mocap consists of movement sequences with up to two subjects for each scene. It contains 2,235 recordings performed by 144 different subjects, eight of which include double-person motion. We directly adopt these two-person motions for comparisons in two-person scenarios. For skeletal representation, we follow Wang et al. (2021b) for the train/test split and the preprocess to mix single-person and double-person motion together to synthesize a 3-person motion in each scene. For meshes generated from SMPL-X representations (Pavlakos et al., 2019), we extract the multi-person data in CMU-Mocap from AMASS (Mahmood et al., 2019) and follow the same strategy to mix single-person and double-person motion together. Mu Po TS-3D consists of more than 8,000 frames with up to three subjects. We convert the data to the same 15-joint human skeleton and length units as CMU-Mocap, and evaluate the generalization on Mu Po TS-3D of a model trained only on CMU-Mocap. We also report our performance on the So Mo F benchmark (Adeli et al., 2020; 2021) in Sec. G of the Appendix. |
| Dataset Splits | No | The paper mentions a “train/test split” but does not explicitly provide details about a validation split or percentages for it. |
| Hardware Specification | Yes | On one NVIDIA Ge Force GTX TITAN X GPU, training an epoch takes approximately 5 minutes. This work used NVIDIA GPUs at NCSA Delta through allocation CIS220014 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by NSF Grants #2138259, #2138286, #2138307, #2137603, and #2138296. |
| Software Dependencies | No | The code is based on Py Torch. We use ADAM (Kingma & Ba, 2014) to train the model. The paper mentions software like PyTorch but does not provide specific version numbers for software dependencies. |
| Experiment Setup | Yes | The two hyperparameters (α, β) in diversity promoting loss (Sec. C) are set to (50, 100). The model is trained using a batch size of 32 for 50 epoch, with 6000 training examples per epoch. We use ADAM (Kingma & Ba, 2014) to train the model. For skeletal representation, following Wang et al. (2021b), we train the model to predict a 15-frame sequence of 3 people given the ground truth 15, 30, and 45 past frames at 15Hz, as the encoder (RNN/Transformer) accepts different input lengths. We use a 6-layer transformer (or RNNs), where we set the feature dimension to 128. For evaluation, we recursively predict the next 15 frames 3 times given all past frames generated, as illustrated in Sec. 3.3. Thus, given the number of intents to be M, the model outputs M, M 2, and M 3 different predictions in sequence. For the SMPL representation, we train the model to predict a 25-frame sequence of 3 people given the 10 past frames at 30Hz. We use an 8-layer transformer, where we set the feature dimension to 512. |