Motion Transformer with Global Intention Localization and Local Movement Refinement
Authors: Shaoshuai Shi, Li Jiang, Dengxin Dai, Bernt Schiele
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that MTR achieves state-of-the-art performance on both the marginal and joint motion prediction challenges, ranking 1st on the leaderboards of Waymo Open Motion Dataset. Code will be available at https://github.com/sshaoshuai/MTR. |
| Researcher Affiliation | Academia | Shaoshuai Shi, Li Jiang, Dengxin Dai, Bernt Schiele Max Planck Institute for Informatics, Saarland Informatics Campus {sshi, lijiang, ddai, schiele}@mpi-inf.mpg.de |
| Pseudocode | No | The paper includes architectural diagrams and mathematical formulations, but no explicit pseudocode or algorithm blocks are provided. |
| Open Source Code | Yes | Code will be available at https://github.com/sshaoshuai/MTR. |
| Open Datasets | Yes | We evaluate our approach on the large-scale Waymo Open Motion Dataset (WOMD) [14] |
| Dataset Splits | Yes | There are totally 487k training scenes, and about 44k validation scenes and 44k testing scenes for each challenge. |
| Hardware Specification | Yes | We train the model for 30 epochs with 8 GPUs (NVDIA RTX 8000) |
| Software Dependencies | No | The paper mentions using an 'Adam W optimizer' but does not specify any software libraries (e.g., PyTorch, TensorFlow) or their version numbers, nor other key software dependencies with specific versions. |
| Experiment Setup | Yes | For the context encoding, we stack 6 transformer encoder layers. The road map is represented as multiple polylines, where each polyline contains up to 20 points (about 10m in WOMD). We select Nm = 768 nearest map polylines around the interested agent. The number of neighbors in encoder s local self-attention is set to 16. The encoder hidden feature dimension is set as D = 256. For the decoder modules, we stack 6 decoder layers. L is set to 128 to collect the closest map polylines from context encoder for motion refinement. By default, we utilize 64 motion query pairs where their intention points are generated by conducting k-means clustering algorithm on the training set. To generate 6 future trajectories for evaluation, we use non-maximum suppression (NMS) to select top 6 predictions from 64 predicted trajectories by calculating the distances between their endpoints, and the distance threshold is set as 2.5m. Our model is trained end-to-end by Adam W optimizer with a learning rate of 0.0001 and batch size of 80 scenes. We train the model for 30 epochs with 8 GPUs (NVDIA RTX 8000), and the learning rate is decayed by a factor of 0.5 every 2 epochs from epoch 20. The weight decay is set as 0.01 and we do not use any data augmentation. |