OpenMask3D: Open-Vocabulary 3D Instance Segmentation

Authors: Ayca Takmaz, Elisabetta Fedele, Robert Sumner, Marc Pollefeys, Federico Tombari, Francis Engelmann

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments and ablation studies on Scan Net200 and Replica show that Open Mask3D outperforms other open-vocabulary methods, especially on the long-tail distribution. Qualitative experiments further showcase Open Mask3D s ability to segment object properties based on free-form queries describing geometry, affordances, and materials.
Researcher Affiliation Collaboration Ayça Takmaz1 Elisabetta Fedele1 Robert W. Sumner1 Marc Pollefeys1,2 Federico Tombari3 Francis Engelmann1,3 1ETH Zürich 2Microsoft 3Google
Pseudocode Yes Algorithm 1 2D mask selection algorithm
Open Source Code Yes openmask3d.github.io
Open Datasets Yes We conduct our experiments using the Scan Net200 [57] and Replica [61] datasets.
Dataset Splits Yes We report our Scan Net200 results on the validation set consisting of 312 scenes, and evaluate for the 3D instance segmentation task using the closed vocabulary of 200 categories from the Scan Net200 annotations.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments were provided. The paper only mentions "a single GPU" for computation time.
Software Dependencies No The paper mentions using specific models and tools like CLIP [55], SAM [36], and Mask3D [58] but does not provide specific version numbers for these software components or any other ancillary software.
Experiment Setup Yes Open Mask3D implementation details. We use posed RGB-depth pairs for both the Scan Net200 and Replica datasets, and we process 1 frame in every 10 frames in the RGB-D sequences. In order to compute image features on the mask-crops, we use CLIP [55] visual encoder from the Vi T-L/14 model pre-trained at a 336 pixel resolution, which has a feature dimensionality of 768. For the visibility score computation, we use kthreshold = 0.2, and for top-view selection we use kview = 5. In all experiments with multi-scale crops, we use L = 3 levels. In the 2D mask selection algorithm based on SAM [36], we repeat the process for krounds = 10 rounds, and sample ksample = 5 points at each iteration.