3D Siamese Voxel-to-BEV Tracker for Sparse Point Clouds
Authors: Le Hui, Lingpeng Wang, Mingmei Cheng, Jin Xie, Jian Yang
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluation on the KITTI and nu Scenes datasets shows that our method significantly outperforms the current state-of-the-art methods by a large margin. |
| Researcher Affiliation | Academia | Le Hui , Lingpeng Wang , Mingmei Cheng, Jin Xie , Jian Yang PCA Lab, Nanjing University of Science and Technology, China {le.hui, cslpwang, chengmm, csjxie, csjyang}@njust.edu.cn |
| Pseudocode | No | The paper describes the architecture and components of the proposed method in detail, but it does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https: //github.com/fpthink/V2B. |
| Open Datasets | Yes | For 3D single object tracking, we use KITTI [20] and nu Scenes [6] datasets for training and evaluation. |
| Dataset Splits | Yes | For the KITTI dataset, we follow [21, 52] and use the training set to train and evaluate our method. It contains 21 video sequences and 8 types of objects. We use scenes 0-16 for training, scenes 17-18 for validation, and scenes 19-20 for testing. |
| Hardware Specification | No | The paper mentions training models and experimental evaluation but does not specify any hardware details like GPU models, CPU types, or memory. |
| Software Dependencies | No | The paper mentions using Point Net++ as a backbone and Adam optimizer, but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | Following [52], we set the number of points N = 512 and M = 1024 for the template and search area by randomly discarding and duplicating points. For the backbone network, we use a slightly modified Point Net++ [51], which consists of three set-abstraction (SA) layers (with query radius of 0.3, 0.5, and 0.7) and three feature propagation (FP) layers. For each SA layer passed, the points will be randomly downsampled by half. For the shape generation network, we generate 2048 points. The global branch is the max pooling combined with two fully connected layers, while the local branch only uses one Edge Conv layer. We use a two layer MLP network to generate 3D coordinates. For 3D center detection, the voxel size is set to 0.3 meters in volumetric space. We stack four 3D convolutions (with stride of 2, 1, 2, 1 along the z-axis) and four 2D convolutions (with stride of 2, 1, 1, 2) combined with the skip connections for feature aggregation, respectively. For all experiments, we use Adam [31] optimizer with learning rate 0.001 for training, and the learning rate decays by 0.2 every 6 epochs. It takes about 20 epochs to train our model to convergence. |