reproducibilityindex.ai

Weakly Supervised 3D Open-vocabulary Segmentation

Authors: Kunhao Liu, Fangneng Zhan, Jiahui Zhang, MUYU XU, Yingchen Yu, Abdulmotaleb El Saddik, Christian Theobalt, Eric Xing, Shijian Lu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments show that our method even outperforms fully supervised models trained with segmentation annotations in certain scenes, suggesting that 3D open-vocabulary segmentation can be effectively learned from 2D images and text-image pairs. and 4 Experiments We evaluate our method on 3D open-vocabulary segmentation, showing that our method can recognize long-tail classes and produce highly accurate object boundaries even with limited input data. We employ Tenso RF [42] as the backbone and extract 3 scales of pixel-level CLIP features. More implementation details and experiments are in the appendix.
Researcher Affiliation	Academia	Kunhao Liu1 Fangneng Zhan2 Jiahui Zhang1 Muyu Xu1 Yingchen Yu1 Abdulmotaleb El Saddik3,5 Christian Theobalt2 Eric Xing4,5 Shijian Lu1 1Nanyang Technological University 2Max Planck Institute for Informatics 3University of Ottawa 4Carnegie Mellon University 5MBZUAI
Pseudocode	Yes	Algorithm 1: Extracting pixel-level features of an image from CLIP
Open Source Code	Yes	Code is available at https://github.com/Kunhao-Liu/3D-OVS.
Open Datasets	No	The paper states, "Thus following [2], we create a dataset comprising 10 distinct scenes," and describes how it was collected: "We capture 10 scenes using smartphones and use Colmap [75] to extract camera parameters for each image." However, it does not provide any direct link, DOI, repository name, or citation that gives concrete public access information for their created dataset.
Dataset Splits	No	The paper mentions "Ground truth masks for the test views are manually annotated" and "We manually annotate the segmentation maps of 5 views for each scene as the ground truth for evaluation." However, it does not provide specific details on training/validation/test splits, such as percentages or sample counts for each split.
Hardware Specification	Yes	The model is trained on an NVIDIA A5000 GPU with 24G memory for 1h30min for each scene.
Software Dependencies	No	The paper mentions using "Tenso RF [42]", "ViT-B/16 CLIP model", "version 1 dino_vitb8 model", and "Colmap [75]". While some models have versions (like "version 1 dino_vitb8"), no specific software versions for frameworks (e.g., PyTorch 1.9) or the exact versions of the other mentioned software/models are provided.
Experiment Setup	Yes	We set τ = 0.2 to get the shaper segmentation probability distribution P. The offset b is set to 0.7 to measure the similarities of the DINO features...We use 3 scales of CLIP features, and the patch sizes of each scale are set as s/5, s/7, and s/10...The weights associated with similar and dissimilar DINO features in LF DA are set as λpos = 200 and λneg = 0.2 by default. and For segmentation training, we train the model for 15k iterations. In the first 5k iterations, we freeze the shared volume and density volume, and train the selection volume and the CLIP feature branch. For the rest 10k iterations, we further finetune the shared volume and the RGB branch. We use Adam optimizer with betas = (0.9, 0.99). The learning rates for training the volume and MLP branch are respectively set to 0.02 and 1e 4. For finetuning the volume and the MLP, the learning rates are set to 5e 3 and 5e 5. We also employ a learning rate decay with a factor of 0.1. and When computing Lsupervision and LRDA, we randomly sample rays with a batch size of 4096. When computing LF DA we randomly sample patches of size 256 256 with a batch size of 8.