SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation

Authors: Huaishao Luo, Junwei Bao, Youzheng Wu, Xiaodong He, Tianrui Li

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that our model achieves comparable or superior segmentation accuracy on the PASCAL VOC 2012 (+0.3% m Io U), PASCAL Context (+2.3% m Io U), and COCO (+2.2% m Io U) compared with baselines.
Researcher Affiliation Collaboration 1JD AI Research 2Southwest Jiaotong University, Chengdu, China.
Pseudocode No The paper does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes We release the code at https: //github.com/Arrow Luo/Seg CLIP.
Open Datasets Yes We pretrain the Seg CLIP on the training splits of Conceptual Captions (CC) (Sharma et al., 2018) and COCO (Lin et al., 2014)
Dataset Splits Yes We pretrain the Seg CLIP on the training splits of Conceptual Captions (CC) (Sharma et al., 2018) and COCO (Lin et al., 2014)... For the semantic segmentation, we evaluate the model on the validation splits of the PASCAL VOC 2012 (Everingham et al., 2010), PASCAL Context (Mottaghi et al., 2014), and COCO datasets.
Hardware Specification Yes We pretrain our model using 8 NVIDIA A100 GPUs with a batch size of 768 for 10 epochs.
Software Dependencies No The paper mentions several components and algorithms (e.g., ViT, CLIP, Adam optimizer, Gumbel-Softmax, GELU) but does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes The image size is set to 224 224, and the patch size is 16 16. The max length of the text tokens is 32. ... For the semantic group module, we put it after the 10th Transformer layer in the image encoder via grid search based on segmentation datasets. The cross-attention layer number is set to 2. The decoder layer of MAE is 3, and the mask rate of patches is 0.75. The number of learnable centers is 8. ... For the optimization, we use Adam optimizer and a cosine schedule of learning rate following the CLIP. The initial learning rate is 4e-6 for the embedding layers, text encoder, and Transformer layers of the image encoder before the semantic group module. For the rest of the parameters, the initial learning rate is 4e-3. We pretrain our model using 8 NVIDIA A100 GPUs with a batch size of 768 for 10 epochs.