Query-based Temporal Fusion with Explicit Motion for 3D Object Detection
Authors: Jinghua Hou, Zhe Liu, dingkang liang, Zhikang Zou, Xiaoqing Ye, Xiang Bai
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show our proposed QTNet outperforms BEV-based or proposal-based manners on the nu Scenes dataset. Besides, the MTM is a plugand-play module, which can be integrated into some advanced Li DAR-only or multi-modality 3D detectors and even brings new SOTA performance with negligible computation cost and latency on the nu Scenes dataset. These experiments powerfully illustrate the superiority and generalization of our method. |
| Researcher Affiliation | Collaboration | Jinghua Hou1 , Zhe Liu1 , Dingkang Liang1 , Zhikang Zou2 , Xiaoqing Ye2 , Xiang Bai1 1Huazhong University of Science & Technology 2Baidu Inc. |
| Pseudocode | No | The paper describes the method and provides mathematical equations, but it does not include a clearly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | The code is available at https://github.com/Almoon Ysl/QTNet. |
| Open Datasets | Yes | Dataset We evaluate our method on the nu Scenes dataset [3], which is a large-scale autonomous driving dataset. It contains 700, 150, and 150 scenes for training, validation, and testing. Each scene is roughly 20 seconds long and annotated at 2Hz, and provides point clouds acquired by 32-beam Li DAR and surrounding images acquired by 6 cameras. |
| Dataset Splits | Yes | It contains 700, 150, and 150 scenes for training, validation, and testing. |
| Hardware Specification | Yes | Then, we train our QTNet for 10 epochs without GT-Sampling [45] and CBGS [58] on four NVIDIA RTX 4090 GPUs. To optimize our network, we adopt the Adam W [29] optimizer with a one-cycle learning rate policy, and the batch size is set to 16. (From Section 4.2) and The FLOPs and Latency are tested on a single NVIDIA RTX 4090 GPU with the batch size of 1. (From Table 3 caption) |
| Software Dependencies | No | The paper mentions optimizers (Adam W) and uses existing detector frameworks (Trans Fusion-L, Trans Fusion, Deep Interaction) but does not specify programming languages, specific library versions (e.g., PyTorch, TensorFlow), or other ancillary software dependencies with version numbers required for reproduction. |
| Experiment Setup | Yes | The voxelization range is set to [ 54m, 54m] for both X and Y axes and [ 5m, 3m] for Z axis. The voxel size is set to (0.075m, 0.075m, 0.1m). We set the number of queries as 200 for training and testing. For temporal fusion, we utilize 2 or 3 historical frames. (From Section 4.2) and Our training process is divided into two stages. In the first stage, we train the DETR-like 3D detectors (Trans Fusion-L, Trans Fusion, and Deep Interaction) with their default settings. In the second stage, we generate the memory bank of query features and prediction results. Then, we train our QTNet for 10 epochs without GT-Sampling [45] and CBGS [58] on four NVIDIA RTX 4090 GPUs. To optimize our network, we adopt the Adam W [29] optimizer with a one-cycle learning rate policy, and the batch size is set to 16. (From Section 4.2) |