OpenGaussian: Towards Point-Level 3D Gaussian-based Open Vocabulary Understanding
Authors: Yanmin Wu, Jiarui Meng, Haijie LI, Chenming Wu, Yahao Shi, Xinhua Cheng, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang, Jian Zhang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments, including open vocabulary-based 3D object selection, 3D point cloud understanding, click-based 3D object selection, and ablation studies, demonstrate the effectiveness of our proposed method. |
| Researcher Affiliation | Collaboration | 1 School of Electronic and Computer Engineering, Peking University 2 Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University 3Baidu VIS 4Beihang University |
| Pseudocode | No | The paper describes processes and uses mathematical formulas but does not include explicit pseudocode blocks or algorithms labeled as such. |
| Open Source Code | Yes | The source code is available at our project page https://3d-aigc.github.io/Open Gaussian. |
| Open Datasets | Yes | We conducted experiments on the Lerf-ovs dataset re-annotated by Lang Splat. The average Io U and accuracy are calculated between the images rendered from the 3D Gaussian points selected by the text query and the GT object masks. We conduct comparisons on the Scan Netv2 dataset [9] |
| Dataset Splits | No | The paper mentions using specific datasets (Lerf-ovs, ScanNet) and randomly selected scenes for evaluation but does not explicitly detail the train/validation/test splits used for the models themselves. |
| Hardware Specification | Yes | We train each scene on a single 32G V100 GPU (with actual memory usage around 16 to 20G). |
| Software Dependencies | No | The paper mentions using components like CLIP, SAM, DINO, and LSeg but does not specify their version numbers or the versions of broader software frameworks (e.g., Python, PyTorch). |
| Experiment Setup | Yes | Consistent with Lang Splat, we first pre-train the standard 3DGS for 30,000 steps. Subsequently, we freeze the Gaussian coordinates, scale, and opacity parameters, and train the instance features for 10,000 steps (Scan Net is 20,000 steps) and the two-layer codebook for 30,000 steps (Scan Net is 40,000 steps). ... (4) Hyperparameters. 1) The values of k in the two-level codebook. In the Scan Net dataset, k1 = 64, k2 = 5 are used uniformly. In the Le RF dataset, for the teatime scene, k1 = 32, k2 = 10; for the other scenes, k1 = 64, k2 = 10. 2) The weights of the coordinates in the coarse-level codebook. In the Scan Net dataset, the weight is 1.0. In the Le RF dataset, the weight for the teatime scene is 0.1, while for the other scenes, the weight is 0.5. 3) The weight of the intra-mask smoothing loss. In the ramen scene of Le RF, the weight is 0.01; for the other scenes and Scan Net, the weight is 0.1. |