Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels
Authors: Heeseong Shin, Chaehyun Kim, Sunghwan Hong, Seokju Cho, Anurag Arnab, Paul Hongsuck Seo, Seungryong Kim
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide results for quantitative comparisons in Tab. 1. We first compare with CLIP, and demonstrate remarkable gains in all benchmarks, bringing in an average of +16.2 m Io U improvement. |
| Researcher Affiliation | Collaboration | Heeseong Shin1 Chaehyun Kim1 Sunghwan Hong2 Seokju Cho1 Anurag Arnab3 Paul Hongsuck Seo2 Seungryong Kim1 1KAIST 2Korea University 3Google Research |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | We are committed to releasing our code upon acceptance. |
| Open Datasets | Yes | We train our model on SA-1B [13] dataset, where we randomly sample 5% of the images. We evaluate our model on zero-shot transfer to semantic segmentation on the validation sets of COCO-Stuff [50], ADE-20K [51], PASCAL-Context [52], PASCAL VOC [53], and City Scapes [54]. |
| Dataset Splits | No | We train our model on SA-1B [13] dataset, where we randomly sample 5% of the images. We train for 10000 iterations with a batch size of 48 for all experiments. For all our experiments, we use a single text prompt A photo of {} in the scene for P , including for our learnable class prompts while training and for inference, we apply prompt ensemble strategy [30] with 7 additional prompts originally curated from CLIP [15]. We evaluate our model on zero-shot transfer to semantic segmentation on the validation sets of COCO-Stuff [50], ADE-20K [51], PASCAL-Context [52], PASCAL VOC [53], and City Scapes [54]. |
| Hardware Specification | Yes | Without specification, we report results on Conv Ne Xt-B [49] backbone with mask annotation from SAM, which takes approximately 6 hours to train with 4 NVIDIA A6000 GPUs. |
| Software Dependencies | No | We implement our work using Py Torch [63] and Detectron2 [64]. |
| Experiment Setup | Yes | We train for 10000 iterations with a batch size of 48 for all experiments. For training, we employ per-pixel binary cross-entropy loss as Lmask to jointly train all of the components [6]. For all our experiments, we use a single text prompt A photo of {} in the scene for P , including for our learnable class prompts while training and for inference, we apply prompt ensemble strategy [30] with 7 additional prompts originally curated from CLIP [15]. ... We set γ = 0.999, input resolution as H = W = 640, which results in h = w = 20, and set h = w = 80 for Conv Ne Xt [49] backbones. For Vi T [62] backbones, we set H = W = 320, which also results in h = w = 20. For global clustering, we set ε = 1 for Conv Ne Xt backbones and ε = 0.01 for Vi T backbones. ... Adam W [65] optimizer is used with a learning rate of 2 × 10−4 for the decoder, 2 × 10−5 for the prompt tokens and 2 × 10−6 for CLIP, with weight decay set to 10−4. Prompt tokens are initialized as random word tokens with l = 4, and k = 64 as default. |