OpenMask3D: Open-Vocabulary 3D Instance Segmentation
Authors: Ayca Takmaz, Elisabetta Fedele, Robert Sumner, Marc Pollefeys, Federico Tombari, Francis Engelmann
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments and ablation studies on Scan Net200 and Replica show that Open Mask3D outperforms other open-vocabulary methods, especially on the long-tail distribution. Qualitative experiments further showcase Open Mask3D s ability to segment object properties based on free-form queries describing geometry, affordances, and materials. |
| Researcher Affiliation | Collaboration | Ayça Takmaz1 Elisabetta Fedele1 Robert W. Sumner1 Marc Pollefeys1,2 Federico Tombari3 Francis Engelmann1,3 1ETH Zürich 2Microsoft 3Google |
| Pseudocode | Yes | Algorithm 1 2D mask selection algorithm |
| Open Source Code | Yes | openmask3d.github.io |
| Open Datasets | Yes | We conduct our experiments using the Scan Net200 [57] and Replica [61] datasets. |
| Dataset Splits | Yes | We report our Scan Net200 results on the validation set consisting of 312 scenes, and evaluate for the 3D instance segmentation task using the closed vocabulary of 200 categories from the Scan Net200 annotations. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments were provided. The paper only mentions "a single GPU" for computation time. |
| Software Dependencies | No | The paper mentions using specific models and tools like CLIP [55], SAM [36], and Mask3D [58] but does not provide specific version numbers for these software components or any other ancillary software. |
| Experiment Setup | Yes | Open Mask3D implementation details. We use posed RGB-depth pairs for both the Scan Net200 and Replica datasets, and we process 1 frame in every 10 frames in the RGB-D sequences. In order to compute image features on the mask-crops, we use CLIP [55] visual encoder from the Vi T-L/14 model pre-trained at a 336 pixel resolution, which has a feature dimensionality of 768. For the visibility score computation, we use kthreshold = 0.2, and for top-view selection we use kview = 5. In all experiments with multi-scale crops, we use L = 3 levels. In the 2D mask selection algorithm based on SAM [36], we repeat the process for krounds = 10 rounds, and sample ksample = 5 points at each iteration. |