Motion Question Answering via Modular Motion Programs

Authors: Mark Endo, Joy Hsu, Jiaman Li, Jiajun Wu

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We generate a dataset of question-answer pairs that require detecting motor cues in small portions of motion sequences, reasoning temporally about when events occur, and querying specific motion attributes. In addition, we propose NSPose, a neurosymbolic method for this task that uses symbolic reasoning and a modular design to ground motion through learning motion concepts, attribute neural operators, and temporal relations. We demonstrate the suitability of NSPose for the Human Motion QA task, outperforming all baseline methods.
Researcher Affiliation Collaboration 1Department of Computer Science, Stanford University. Correspondence to: Mark Endo <markendo@stanford.edu>. Acknowledgments. We thank Sumith Kulal for providing valuable feedback on the paper. This work is in part supported by Stanford Institute for Human-Centered Artificial Intelligence (HAI), Stanford Wu Tsai Human Performance Alliance, Toyota Research Institute (TRI), NSF RI #2211258, ONR MURI N00014-22-1-2740, AFOSR YIP FA9550-23-1-0127, Analog Devices, JPMorgan Chase, Meta, and Salesforce.
Pseudocode Yes A.1. Domain-specific language & program implementations We define the domain-specific language (DSL) used for the Human Motion QA task. Table 3 includes signatures and semantics for all functions, and Table 4 includes implementations for all functions.
Open Source Code Yes The code for generating this dataset is available at https: //github.com/markendo/Human Motion QA/.
Open Datasets Yes To build BABEL-QA, we create question-answer pairs from motion sequences and annotations in the BABEL dataset (Punnakkal et al., 2021). We leverage BABEL, as it contains dense labels that describe each individual action in the temporal composition, in addition to when the action occurs in the motion sequence.
Dataset Splits Yes With this processing, our final dataset is composed of 771 train motion sequences, 167 validation motion sequences, and 171 test motion sequences with an associated 1800 train questions, 384 validation questions, and 393 test questions.
Hardware Specification No The paper does not specify the hardware used (e.g., GPU model, CPU type) for running the experiments.
Software Dependencies No The paper mentions models and components like "Two-Stream Adaptive Graph Convolutional Network (2s-AGCN)" and "1D convolutional layers", but it does not specify software dependencies with version numbers (e.g., PyTorch 1.9, Python 3.8).
Experiment Setup Yes We split each input motion sequence into segments of f frames, with varying number of segments in each sequence. We also overlap segments by o frames on each side in order to provide the model with more context in each segment. In our experiments, we set f = 45 and o = 15. ... CNN has three intermediate convolution layers with 16 filters per layer, kernel size of three, and and exponential dilation in every layer.