Point Transformer V2: Grouped Vector Attention and Partition-based Pooling

Authors: Xiaoyang Wu, Yixing Lao, Li Jiang, Xihui Liu, Hengshuang Zhao

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that our model achieves better performance than its predecessor and achieves state-of-the-art on several challenging 3D point cloud understanding benchmarks, including 3D point cloud segmentation on Scan Net v2 and S3DIS and 3D point cloud classification on Model Net40. We conducted extensive analysis and controlled experiments to validate our designs. Our results indicate that PTv2 outperforms predecessor works and sets the new state-of-the-art on various 3D understanding tasks. 4 Experiments To validate the effectiveness of the proposed method, we conduct experimental evaluations on Scan Net v2 [44] and S3DIS [45] for semantic segmentation, and Model Net40 [46] for shape classification.
Researcher Affiliation Collaboration Xiaoyang Wu1 Yixing Lao2 Li Jiang3 Xihui Liu1 Hengshuang Zhao1 1The University of Hong Kong 2Intel Labs 3Max Planck Institute {xywu3, hszhao}@cs.hku.hk
Pseudocode No The paper describes methods textually and with diagrams (e.g., Figure 1, Figure 2), but does not include explicit pseudocode or algorithm blocks.
Open Source Code Yes Our code will be available at https://github.com/Gofinge/Point Transformer V2.
Open Datasets Yes To validate the effectiveness of the proposed method, we conduct experimental evaluations on Scan Net v2 [44] and S3DIS [45] for semantic segmentation, and Model Net40 [46] for shape classification.
Dataset Splits Yes The Scan Net v2 dataset contains 1,513 room scans reconstructed from RGB-D frames. The dataset is divided into 1,201 scenes for training and 312 for validation. Following a common protocol [36, 4, 1], area 5 is withheld during training and used for testing. 9,843 models are split out for training, and the rest 2,468 models are reserved for testing.
Hardware Specification Yes We record the amortized forward time for each scan in the Scan Net v2 validation set with batch size 4 on a single TITAN RTX.
Software Dependencies No The paper mentions 'Implementation details are available in the appendix' but does not provide specific software dependencies with version numbers in the provided text.
Experiment Setup Yes Backbone structure. Following previous works [18, 1], we adopt the U-Net architecture with skip connections. There are four stages of encoders and decoders with block depths [2, 2, 6, 2] and [1, 1, 1, 1], respectively. The grid size multipliers for the four stages are [x3.0, x2.5, x2.5, x2.5], representing the expansion ratio over the previous pooling stage. The initial feature dimension is 48, and we first embed the input channels to this number with a basic block with attention groups of 6. Then, we double this feature dimension and attention groups each time entering the next encoding stage. For the four encoding stages, the feature dimensions are [96, 192, 384, 384], and the corresponding attention groups are [12, 24, 48, 48]. We record the amortized forward time for each scan in the Scan Net v2 validation set with batch size 4 on a single TITAN RTX.