reproducibilityindex.ai

SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation

Authors: Huaishao Luo, Junwei Bao, Youzheng Wu, Xiaodong He, Tianrui Li

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that our model achieves comparable or superior segmentation accuracy on the PASCAL VOC 2012 (+0.3% m Io U), PASCAL Context (+2.3% m Io U), and COCO (+2.2% m Io U) compared with baselines.
Researcher Affiliation	Collaboration	1JD AI Research 2Southwest Jiaotong University, Chengdu, China.
Pseudocode	No	The paper does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We release the code at https: //github.com/Arrow Luo/Seg CLIP.
Open Datasets	Yes	We pretrain the Seg CLIP on the training splits of Conceptual Captions (CC) (Sharma et al., 2018) and COCO (Lin et al., 2014)
Dataset Splits	Yes	We pretrain the Seg CLIP on the training splits of Conceptual Captions (CC) (Sharma et al., 2018) and COCO (Lin et al., 2014)... For the semantic segmentation, we evaluate the model on the validation splits of the PASCAL VOC 2012 (Everingham et al., 2010), PASCAL Context (Mottaghi et al., 2014), and COCO datasets.
Hardware Specification	Yes	We pretrain our model using 8 NVIDIA A100 GPUs with a batch size of 768 for 10 epochs.
Software Dependencies	No	The paper mentions several components and algorithms (e.g., ViT, CLIP, Adam optimizer, Gumbel-Softmax, GELU) but does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	The image size is set to 224 224, and the patch size is 16 16. The max length of the text tokens is 32. ... For the semantic group module, we put it after the 10th Transformer layer in the image encoder via grid search based on segmentation datasets. The cross-attention layer number is set to 2. The decoder layer of MAE is 3, and the mask rate of patches is 0.75. The number of learnable centers is 8. ... For the optimization, we use Adam optimizer and a cosine schedule of learning rate following the CLIP. The initial learning rate is 4e-6 for the embedding layers, text encoder, and Transformer layers of the image encoder before the semantic group module. For the rest of the parameters, the initial learning rate is 4e-3. We pretrain our model using 8 NVIDIA A100 GPUs with a batch size of 768 for 10 epochs.