3D Siamese Voxel-to-BEV Tracker for Sparse Point Clouds

Authors: Le Hui, Lingpeng Wang, Mingmei Cheng, Jin Xie, Jian Yang

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluation on the KITTI and nu Scenes datasets shows that our method significantly outperforms the current state-of-the-art methods by a large margin.
Researcher Affiliation Academia Le Hui , Lingpeng Wang , Mingmei Cheng, Jin Xie , Jian Yang PCA Lab, Nanjing University of Science and Technology, China {le.hui, cslpwang, chengmm, csjxie, csjyang}@njust.edu.cn
Pseudocode No The paper describes the architecture and components of the proposed method in detail, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https: //github.com/fpthink/V2B.
Open Datasets Yes For 3D single object tracking, we use KITTI [20] and nu Scenes [6] datasets for training and evaluation.
Dataset Splits Yes For the KITTI dataset, we follow [21, 52] and use the training set to train and evaluate our method. It contains 21 video sequences and 8 types of objects. We use scenes 0-16 for training, scenes 17-18 for validation, and scenes 19-20 for testing.
Hardware Specification No The paper mentions training models and experimental evaluation but does not specify any hardware details like GPU models, CPU types, or memory.
Software Dependencies No The paper mentions using Point Net++ as a backbone and Adam optimizer, but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes Following [52], we set the number of points N = 512 and M = 1024 for the template and search area by randomly discarding and duplicating points. For the backbone network, we use a slightly modified Point Net++ [51], which consists of three set-abstraction (SA) layers (with query radius of 0.3, 0.5, and 0.7) and three feature propagation (FP) layers. For each SA layer passed, the points will be randomly downsampled by half. For the shape generation network, we generate 2048 points. The global branch is the max pooling combined with two fully connected layers, while the local branch only uses one Edge Conv layer. We use a two layer MLP network to generate 3D coordinates. For 3D center detection, the voxel size is set to 0.3 meters in volumetric space. We stack four 3D convolutions (with stride of 2, 1, 2, 1 along the z-axis) and four 2D convolutions (with stride of 2, 1, 1, 2) combined with the skip connections for feature aggregation, respectively. For all experiments, we use Adam [31] optimizer with learning rate 0.001 for training, and the learning rate decays by 0.2 every 6 epochs. It takes about 20 epochs to train our model to convergence.