SMART: Scalable Multi-agent Real-time Motion Generation via Next-token Prediction

Authors: Wei Wu, Xiaoxin Feng, Ziyan Gao, Yuheng KAN

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental SMART achieves state-of-the-art performance across most of the metrics on the generative Sim Agents challenge, ranking 1st on the leaderboards of Waymo Open Motion Dataset (WOMD), demonstrating remarkable inference speed. Moreover, SMART represents the generative model in the autonomous driving motion domain, exhibiting zero-shot generalization capabilities: Using only the Nu Plan dataset for training and WOMD for validation, SMART achieved a competitive score of 0.72 on the Sim Agents challenge. Lastly, we have collected over 1 billion motion tokens from multiple datasets, validating the model s scalability.
Researcher Affiliation Collaboration Wei Wu Tsinghua University Sense Time Research wuwei@senseauto.com Xiaoxin Feng Sense Time Research fengxiaoxin@senseauto.com Ziyan Gao Sense Time Research gaoziyan@senseauto.com Yuheng Kan Sense Time Research kanyuheng@senseauto.com
Pseudocode No The paper describes the model architecture and training tasks but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes We have released all the code to promote the exploration of models for motion generation in the autonomous driving field. The source code is available at https://github.com/rainmaker22/SMART.
Open Datasets Yes ranking 1st on the leaderboards of Waymo Open Motion Dataset (WOMD), demonstrating remarkable inference speed. Moreover, SMART represents the generative model in the autonomous driving motion domain, exhibiting zero-shot generalization capabilities: Using only the Nu Plan dataset for training and WOMD for validation, SMART achieved a competitive score of 0.72 on the Sim Agents challenge.
Dataset Splits Yes For all experiments, the testing datasets employed the split validation dataset from WOMD. Overall, we trained models across four sizes, ranging from 1M to 100M parameters, on a training set containing 2.2M scenarios (or 1B motion tokens under 0.5s agent motion tokenization).
Hardware Specification Yes The training and inference time is measured on 32 NVIDIA TESLA V100 GPUs. All models in this paper were trained using 32 V100 GPUs. The training process requires a GPU memory of at least 25GB, while model inference typically requires only 10GB of memory.
Software Dependencies No The paper mentions using the 'Adam W optimizer [23]' but does not specify versions for other key software components or libraries (e.g., Python, PyTorch, CUDA).
Experiment Setup Yes Both the dropout rate and the weight decay rate are set to 0.1. The learning rate is decayed from 0.0002 to 0 using a cosine annealing scheduler. Training includes all vehicles within a scene. The batch size is set to 4, with a maximum GPU memory usage of 30GB.