Motion Transformer with Global Intention Localization and Local Movement Refinement

Authors: Shaoshuai Shi, Li Jiang, Dengxin Dai, Bernt Schiele

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that MTR achieves state-of-the-art performance on both the marginal and joint motion prediction challenges, ranking 1st on the leaderboards of Waymo Open Motion Dataset. Code will be available at https://github.com/sshaoshuai/MTR.
Researcher Affiliation Academia Shaoshuai Shi, Li Jiang, Dengxin Dai, Bernt Schiele Max Planck Institute for Informatics, Saarland Informatics Campus {sshi, lijiang, ddai, schiele}@mpi-inf.mpg.de
Pseudocode No The paper includes architectural diagrams and mathematical formulations, but no explicit pseudocode or algorithm blocks are provided.
Open Source Code Yes Code will be available at https://github.com/sshaoshuai/MTR.
Open Datasets Yes We evaluate our approach on the large-scale Waymo Open Motion Dataset (WOMD) [14]
Dataset Splits Yes There are totally 487k training scenes, and about 44k validation scenes and 44k testing scenes for each challenge.
Hardware Specification Yes We train the model for 30 epochs with 8 GPUs (NVDIA RTX 8000)
Software Dependencies No The paper mentions using an 'Adam W optimizer' but does not specify any software libraries (e.g., PyTorch, TensorFlow) or their version numbers, nor other key software dependencies with specific versions.
Experiment Setup Yes For the context encoding, we stack 6 transformer encoder layers. The road map is represented as multiple polylines, where each polyline contains up to 20 points (about 10m in WOMD). We select Nm = 768 nearest map polylines around the interested agent. The number of neighbors in encoder s local self-attention is set to 16. The encoder hidden feature dimension is set as D = 256. For the decoder modules, we stack 6 decoder layers. L is set to 128 to collect the closest map polylines from context encoder for motion refinement. By default, we utilize 64 motion query pairs where their intention points are generated by conducting k-means clustering algorithm on the training set. To generate 6 future trajectories for evaluation, we use non-maximum suppression (NMS) to select top 6 predictions from 64 predicted trajectories by calculating the distances between their endpoints, and the distance threshold is set as 2.5m. Our model is trained end-to-end by Adam W optimizer with a learning rate of 0.0001 and batch size of 80 scenes. We train the model for 30 epochs with 8 GPUs (NVDIA RTX 8000), and the learning rate is decayed by a factor of 0.5 every 2 epochs from epoch 20. The weight decay is set as 0.01 and we do not use any data augmentation.