Weakly Supervised 3D Open-vocabulary Segmentation
Authors: Kunhao Liu, Fangneng Zhan, Jiahui Zhang, MUYU XU, Yingchen Yu, Abdulmotaleb El Saddik, Christian Theobalt, Eric Xing, Shijian Lu
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that our method even outperforms fully supervised models trained with segmentation annotations in certain scenes, suggesting that 3D open-vocabulary segmentation can be effectively learned from 2D images and text-image pairs. and 4 Experiments We evaluate our method on 3D open-vocabulary segmentation, showing that our method can recognize long-tail classes and produce highly accurate object boundaries even with limited input data. We employ Tenso RF [42] as the backbone and extract 3 scales of pixel-level CLIP features. More implementation details and experiments are in the appendix. |
| Researcher Affiliation | Academia | Kunhao Liu1 Fangneng Zhan2 Jiahui Zhang1 Muyu Xu1 Yingchen Yu1 Abdulmotaleb El Saddik3,5 Christian Theobalt2 Eric Xing4,5 Shijian Lu1 1Nanyang Technological University 2Max Planck Institute for Informatics 3University of Ottawa 4Carnegie Mellon University 5MBZUAI |
| Pseudocode | Yes | Algorithm 1: Extracting pixel-level features of an image from CLIP |
| Open Source Code | Yes | Code is available at https://github.com/Kunhao-Liu/3D-OVS. |
| Open Datasets | No | The paper states, "Thus following [2], we create a dataset comprising 10 distinct scenes," and describes how it was collected: "We capture 10 scenes using smartphones and use Colmap [75] to extract camera parameters for each image." However, it does not provide any direct link, DOI, repository name, or citation that gives concrete public access information for *their created dataset*. |
| Dataset Splits | No | The paper mentions "Ground truth masks for the test views are manually annotated" and "We manually annotate the segmentation maps of 5 views for each scene as the ground truth for evaluation." However, it does not provide specific details on training/validation/test splits, such as percentages or sample counts for each split. |
| Hardware Specification | Yes | The model is trained on an NVIDIA A5000 GPU with 24G memory for 1h30min for each scene. |
| Software Dependencies | No | The paper mentions using "Tenso RF [42]", "ViT-B/16 CLIP model", "version 1 dino_vitb8 model", and "Colmap [75]". While some models have versions (like "version 1 dino_vitb8"), no specific software versions for frameworks (e.g., PyTorch 1.9) or the exact versions of the other mentioned software/models are provided. |
| Experiment Setup | Yes | We set τ = 0.2 to get the shaper segmentation probability distribution P. The offset b is set to 0.7 to measure the similarities of the DINO features...We use 3 scales of CLIP features, and the patch sizes of each scale are set as s/5, s/7, and s/10...The weights associated with similar and dissimilar DINO features in LF DA are set as λpos = 200 and λneg = 0.2 by default. and For segmentation training, we train the model for 15k iterations. In the first 5k iterations, we freeze the shared volume and density volume, and train the selection volume and the CLIP feature branch. For the rest 10k iterations, we further finetune the shared volume and the RGB branch. We use Adam optimizer with betas = (0.9, 0.99). The learning rates for training the volume and MLP branch are respectively set to 0.02 and 1e 4. For finetuning the volume and the MLP, the learning rates are set to 5e 3 and 5e 5. We also employ a learning rate decay with a factor of 0.1. and When computing Lsupervision and LRDA, we randomly sample rays with a batch size of 4096. When computing LF DA we randomly sample patches of size 256 256 with a batch size of 8. |