Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels
Authors: Heeseong Shin, Chaehyun Kim, Sunghwan Hong, Seokju Cho, Anurag Arnab, Paul Hongsuck Seo, Seungryong Kim
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide results for quantitative comparisons in Tab. 1. We first compare with CLIP, and demonstrate remarkable gains in all benchmarks, bringing in an average of +16.2 m Io U improvement. |
| Researcher Affiliation | Collaboration | Heeseong Shin1 Chaehyun Kim1 Sunghwan Hong2 Seokju Cho1 Anurag Arnab3 Paul Hongsuck Seo2 Seungryong Kim1 1KAIST 2Korea University 3Google Research |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | We are committed to releasing our code upon acceptance. |
| Open Datasets | Yes | We train our model on SA-1B [13] dataset, where we randomly sample 5% of the images. We evaluate our model on zero-shot transfer to semantic segmentation on the validation sets of COCO-Stuff [50], ADE-20K [51], PASCAL-Context [52], PASCAL VOC [53], and City Scapes [54]. |
| Dataset Splits | No | We train our model on SA-1B [13] dataset, where we randomly sample 5% of the images. We train for 10000 iterations with a batch size of 48 for all experiments. For all our experiments, we use a single text prompt A photo of {} in the scene for P , including for our learnable class prompts while training and for inference, we apply prompt ensemble strategy [30] with 7 additional prompts originally curated from CLIP [15]. We evaluate our model on zero-shot transfer to semantic segmentation on the validation sets of COCO-Stuff [50], ADE-20K [51], PASCAL-Context [52], PASCAL VOC [53], and City Scapes [54]. |
| Hardware Specification | Yes | Without specification, we report results on Conv Ne Xt-B [49] backbone with mask annotation from SAM, which takes approximately 6 hours to train with 4 NVIDIA A6000 GPUs. |
| Software Dependencies | No | We implement our work using Py Torch [63] and Detectron2 [64]. |
| Experiment Setup | Yes | We train for 10000 iterations with a batch size of 48 for all experiments. For training, we employ per-pixel binary cross-entropy loss as Lmask to jointly train all of the components [6]. For all our experiments, we use a single text prompt A photo of {} in the scene for P , including for our learnable class prompts while training and for inference, we apply prompt ensemble strategy [30] with 7 additional prompts originally curated from CLIP [15]. ... We set γ = 0.999, input resolution as H = W = 640, which results in h = w = 20, and set h = w = 80 for Conv Ne Xt [49] backbones. For Vi T [62] backbones, we set H = W = 320, which also results in h = w = 20. For global clustering, we set ε = 1 for Conv Ne Xt backbones and ε = 0.01 for Vi T backbones. ... Adam W [65] optimizer is used with a learning rate of 2 × 10−4 for the decoder, 2 × 10−5 for the prompt tokens and 2 × 10−6 for CLIP, with weight decay set to 10−4. Prompt tokens are initialized as random word tokens with l = 4, and k = 64 as default. |