UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation
Authors: Qingdong He, Jinlong Peng, Zhengkai Jiang, Kai Wu, Xiaozhong Ji, Jiangning Zhang, Yabiao Wang, Chengjie Wang, Mingang Chen, Yunsheng Wu
IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results demonstrate the effectiveness and superiority of our method in open-vocabulary semantic and instance segmentation, which achieves state-of-theart performance on both indoor and outdoor benchmarks such as Scan Net, Scan Net200, S3IDS and nu Scenes. |
| Researcher Affiliation | Industry | You Tu Lab, Tencent 2Shanghai Development Center of Computer Software Technology {yingcaihe, jeromepeng, zhengkjiang, lloydwu, xiaozhongji, vtzhang, caseywang, jasoncjwang, simonwu}@tencent.com, cmg@sscenter.sh.cn |
| Pseudocode | No | No explicit pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | Code is available at https://github.com/ hithqd/Uni M-OV3D. |
| Open Datasets | Yes | To validate the effectiveness of our proposed Uni M-OV3D, we conduct extensive experiments on four popular public 3D benchmarks: Scan Net [Dai et al., 2017], Scan Net200 [Rozenberszki et al., 2022], S3DIS [Armeni et al., 2016] and nu Scenes [Caesar et al., 2020]. |
| Dataset Splits | Yes | Following [Ding et al., 2023b; Yang et al., 2023], we disregard the other-furniture class in Scan Net and randomly partition the rest 19 classes into 3 base/novel partitions, i.e. B15/N4 (15 base and 4 novel categories), B12/N7 and B10/N9, for semantic segmentation. Following Soft Group [Vu et al., 2022] to exclude two background classes, we acquire B13/N4, B10/N7, and B8/N9 partitions for instance segmentation on Scan Net. Similarly, we ignore the clutter class in S3DIS and get B8/N4, B6/N6 for both semantic and instance segmentation. For Scan Net200, we split 200 classes to B170/N30 and B150/N50. For nu Scenes, we drop the other-flat class and obtain B12/N3 and B10/N5. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU, GPU models, or memory specifications) used for running the experiments. |
| Software Dependencies | No | The paper mentions various pre-trained models and frameworks (e.g., CLIP, DINO, Point BIND, UNet) that were used or built upon, but it does not specify version numbers for any software dependencies, libraries, or programming languages used in the implementation. |
| Experiment Setup | Yes | The final caption loss is a weighted combination between different views as follows: Ltotal capt = αLglobal capt + βLeye capt + γLsector capt where α, β and γ are used to balance the relative importance of different parts and are set to 1, 0.8, 0.8 by default. ... And ϵ is a learnable temperature parameter, similar to CLIP [Radford et al., 2021]. |