XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation

Authors: Ziyi Wang, Yanbo Wang, Xumin Yu, Jie Zhou, Jiwen Lu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on multiple benchmarks of different datasets, including Scan Net20 [9], Scan Net200 [38], and S3DIS [1] datasets, to evaluate the effectiveness of our proposed method. XMask3D demonstrates competitive performance across all benchmarks. Additionally, we perform thorough ablation studies and provide intuitive visualizations to showcase the contribution of each proposed mask-level technique.
Researcher Affiliation Academia Ziyi Wang Yanbo Wang Xumin Yu Jie Zhou Jiwen Lu Department of Automation, Tsinghua University, China {wziyi22, wyb23, yuxm20}@mails.tsinghua.edu.cn; {jzhou, lujiwen}@tsinghua.edu.cn
Pseudocode No The paper does not contain any pseudocode or algorithm blocks. It provides architectural diagrams and mathematical equations but no step-by-step algorithmic descriptions.
Open Source Code Yes Code is available at https://github.com/wangzy22/XMask3D.
Open Datasets Yes In accordance with prior literature, our research conducts experimentation on two prominent indoor scene datasets: Scan Net [9] and S3DIS [1].
Dataset Splits Yes Scan Net, a foundational dataset in this domain, comprises 1201 scenes allocated for training and 312 scenes designated for validation. Each scene within Scan Net furnishes point cloud data, multi-view images, and corresponding camera pose matrices. Similarly, S3DIS offers analogous data modalities, encompassing 271 rooms across six distinct indoor environments. Conforming to established conventions, we reserve Area 5 of S3DIS for validation purposes, ensuring consistency with prior methodologies.
Hardware Specification Yes The training regimen for the XMask3D model involves utilizing the Adam W optimizer [29] with a Cosine learning rate scheduler. We train the model for 150 epochs on 4 NVIDIA A800 GPUs, employing a batch size of 64.
Software Dependencies No The paper mentions various software components and models like Mink UNet, ODISE, CLIP, Mask2Former, AdamW optimizer, and ViT-GPT2, but it does not specify version numbers for any of these software dependencies.
Experiment Setup Yes The training regimen for the XMask3D model involves utilizing the Adam W optimizer [29] with a Cosine learning rate scheduler. We train the model for 150 epochs on 4 NVIDIA A800 GPUs, employing a batch size of 64. Notably, we introduce mask-level regularization to the training pipeline after the initial 50 epochs. For all benchmarks, we set the same ωseg = 4, ω3d view = 1, ω2d view = 4, ωfuse view = 1.5 as the hyper-parameter choices. The ωmask and ωbi are set differently across benchmarks. We set ωmask = 0.5/0.5/1/2/2/1/1, ωbi = 16/12/8/48/32/20/15 for Scan Net B15/N4, B12/N7, B10/N9, Scan Net200 B170/N30, B150/N50, S3DIS B8/N4, B6/N6 benchmarks, respectively.