Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation
Authors: Yunheng Li, Zhong-Yu Li, Quan-Sheng Zeng, Qibin Hou, Ming-Ming Cheng
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that our simple Cascade-CLIP achieves superior zero-shot performance on segmentation benchmarks, like COCO-Stuff, Pascal-VOC, and Pascal-Context. Our code is available at https://github. com/HVision-NKU/Cascade-CLIP. 4. Experiments 4.1. Datasets and Evaluation Metrics 4.2. Implementation Details 4.3. Comparisons with the State-of-the-art Methods 4.4. Ablation Study 4.5. Extending Cascade-CLIP to Other Methods |
| Researcher Affiliation | Academia | 1VCIP, School of Computer Science, Nankai University 2Nankai International Advanced Research Institute (Shenzhen Futian). Correspondence to: Qibin Hou <houqb@nankai.edu.cn>. |
| Pseudocode | No | The paper describes the proposed framework and components using textual descriptions and diagrams (e.g., Figure 2, Figure 3), but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github. com/HVision-NKU/Cascade-CLIP. |
| Open Datasets | Yes | To evaluate the effectiveness of our proposed method, we perform extensive experiments on three widely used benchmark datasets, including COCO-Stuff (Caesar et al., 2018), Pascal-VOC (Everingham et al., 2015), and Pascal Context (Mottaghi et al., 2014). ... COCO-Stuff is an extensive semantic segmentation dataset comprising 171 categories... It contains 117k training images and 5k validation images... PASCAL VOC consists of 11,185 training images and 1,449 validation images... PASCAL Context provides supplementary annotations for PASCAL VOC 2010, consisting of 4,998 training images and 5,005 validation images. |
| Dataset Splits | Yes | COCO-Stuff is an extensive semantic segmentation dataset comprising 171 categories... It contains 117k training images and 5k validation images and it is divided into 156 seen classes and 15 unseen classes. ... PASCAL VOC consists of 11,185 training images and 1,449 validation images across 20 classes. ... PASCAL Context provides supplementary annotations for PASCAL VOC 2010, consisting of 4,998 training images and 5,005 validation images. |
| Hardware Specification | Yes | We implement the proposed method on the open-source toolbox MMSegmentation (Contributors, 2020) and conduct all experiments using a machine with 4 NVIDIA RTX 3090 GPUs. ... All models are evaluated on a single 3090 GPU. |
| Software Dependencies | No | We implement the proposed method on the open-source toolbox MMSegmentation (Contributors, 2020)... While MMSegmentation is mentioned, no specific version number for it or other core software dependencies (like Python, PyTorch, or CUDA) is provided. |
| Experiment Setup | Yes | The batch size on each GPU is set to 4, and the input image resolution is 512 × 512. The optimizer is Adam W (Loshchilov & Hutter, 2019) with the default training schedule in the MMSeg toolbox. For a fair comparison, we use the same number of training iterations on each dataset as Zeg CLIP (Zhou et al., 2023). ... The objective loss function Lpixel is defined as: Lpixel = αLdice(Y, M) + βLfocal(Y, M)... {α, β} are two weights with the default values of {1, 100}, respectively. |