Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels

Authors: Heeseong Shin, Chaehyun Kim, Sunghwan Hong, Seokju Cho, Anurag Arnab, Paul Hongsuck Seo, Seungryong Kim

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide results for quantitative comparisons in Tab. 1. We first compare with CLIP, and demonstrate remarkable gains in all benchmarks, bringing in an average of +16.2 m Io U improvement.
Researcher Affiliation Collaboration Heeseong Shin1 Chaehyun Kim1 Sunghwan Hong2 Seokju Cho1 Anurag Arnab3 Paul Hongsuck Seo2 Seungryong Kim1 1KAIST 2Korea University 3Google Research
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No We are committed to releasing our code upon acceptance.
Open Datasets Yes We train our model on SA-1B [13] dataset, where we randomly sample 5% of the images. We evaluate our model on zero-shot transfer to semantic segmentation on the validation sets of COCO-Stuff [50], ADE-20K [51], PASCAL-Context [52], PASCAL VOC [53], and City Scapes [54].
Dataset Splits No We train our model on SA-1B [13] dataset, where we randomly sample 5% of the images. We train for 10000 iterations with a batch size of 48 for all experiments. For all our experiments, we use a single text prompt A photo of {} in the scene for P , including for our learnable class prompts while training and for inference, we apply prompt ensemble strategy [30] with 7 additional prompts originally curated from CLIP [15]. We evaluate our model on zero-shot transfer to semantic segmentation on the validation sets of COCO-Stuff [50], ADE-20K [51], PASCAL-Context [52], PASCAL VOC [53], and City Scapes [54].
Hardware Specification Yes Without specification, we report results on Conv Ne Xt-B [49] backbone with mask annotation from SAM, which takes approximately 6 hours to train with 4 NVIDIA A6000 GPUs.
Software Dependencies No We implement our work using Py Torch [63] and Detectron2 [64].
Experiment Setup Yes We train for 10000 iterations with a batch size of 48 for all experiments. For training, we employ per-pixel binary cross-entropy loss as Lmask to jointly train all of the components [6]. For all our experiments, we use a single text prompt A photo of {} in the scene for P , including for our learnable class prompts while training and for inference, we apply prompt ensemble strategy [30] with 7 additional prompts originally curated from CLIP [15]. ... We set γ = 0.999, input resolution as H = W = 640, which results in h = w = 20, and set h = w = 80 for Conv Ne Xt [49] backbones. For Vi T [62] backbones, we set H = W = 320, which also results in h = w = 20. For global clustering, we set ε = 1 for Conv Ne Xt backbones and ε = 0.01 for Vi T backbones. ... Adam W [65] optimizer is used with a learning rate of 2 × 10−4 for the decoder, 2 × 10−5 for the prompt tokens and 2 × 10−6 for CLIP, with weight decay set to 10−4. Prompt tokens are initialized as random word tokens with l = 4, and k = 64 as default.