Frozen CLIP Transformer Is an Efficient Point Cloud Encoder

Authors: Xiaoshui Huang, Zhou Huang, Sheng Li, Wentao Qu, Tong He, Yuenan Hou, Yifan Zuo, Wanli Ouyang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments on 3D detection, semantic segmentation, classification and few-shot learning demonstrate that the CLIP transformer can serve as an efficient point cloud encoder and our method achieves promising performance on both indoor and outdoor benchmarks. In particular, performance gains brought by our EPCL are 19.7 AP50 on Scan Net V2 detection, 4.4 m Io U on S3DIS segmentation and 1.2 m Io U on Semantic KITTI segmentation compared to contemporary pretrained models.
Researcher Affiliation Collaboration 1Shanghai AI Laboratory 2Jiangxi University of Finance and Economics 3University of Electronic Science and Technology of China 4Nanjing University of Science and Technology
Pseudocode No The paper includes mathematical formulations but does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/Xiaoshui Huang/EPCL.
Open Datasets Yes Datasets. We conduct real-world detection on Scan Net (Dai et al. 2017), indoor semantic segmentation on S3DIS (Armeni et al. 2016) and outdoor semantic segmentation on Semantic KITTI Behley et al. (2019). Also, we evaluate the accuracy of few-shot learning and classification on synthetic Model Net40 (Wu et al. 2015).
Dataset Splits No The paper mentions evaluating on the "Semantic KITTI validation set" in Table 3, and refers to S3DIS Area5 which is a common split. However, it does not provide specific split percentages or absolute sample counts for training, validation, or test sets across its experiments (e.g., "80% training, 10% validation, 10% test" or explicit numbers of samples for each split).
Hardware Specification No The paper does not provide specific hardware details such as GPU models (e.g., NVIDIA A100, RTX 3090), CPU models, or cloud computing instance types used for running the experiments.
Software Dependencies No The paper does not specify the version numbers for any key software components or libraries (e.g., Python version, PyTorch version, specific deep learning frameworks or solvers).
Experiment Setup No The paper describes the general training strategy (e.g., freezing CLIP, finetuning tokenizer/task token/head) but does not provide specific experimental setup details such as learning rates, batch sizes, number of epochs, optimizer types, or other hyperparameters. It mentions "initialize the task token as enumerated numbers" and "three-layer MLP is usually applied for obtaining point cloud token embeddings" which are structural details, not specific hyperparameters.