Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP

Authors: Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, Liang-Chieh Chen

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Surprisingly, FC-CLIP advances state-of-the-art results on various benchmarks, while running practically fast. Specifically, when training on COCO panoptic data only and testing in a zero-shot manner, FC-CLIP achieve 26.8 PQ, 16.8 AP, and 34.1 m Io U on ADE20K, 18.2 PQ, 27.9 m Io U on Mapillary Vistas, 44.0 PQ, 26.8 AP, 56.2 m Io U on Cityscapes, outperforming the prior art under the same setting by +4.2 PQ, +2.4 AP, +4.2 m Io U on ADE20K, +4.0 PQ on Mapillary Vistas and +20.1 PQ on Cityscapes, respectively. Additionally, the training and testing time of FC-CLIP is 7.5 and 6.6 significantly faster than the same prior art, while using 5.9 fewer total model parameters. Meanwhile, FC-CLIP also sets a new state-of-the-art performance across various open-vocabulary semantic segmentation datasets. Code and models are available at https://github.com/bytedance/fc-clip.
Researcher Affiliation Collaboration Qihang Yu1, Ju He2, Xueqing Deng1, Xiaohui Shen1, Liang-Chieh Chen1 1 Byte Dance 2 The Johns Hopkins University
Pseudocode No The paper describes the architecture and processes in text and diagrams (e.g., Figure 3), but does not provide structured pseudocode or algorithm blocks.
Open Source Code Yes Code and models are available at https://github.com/bytedance/fc-clip.
Open Datasets Yes We train FC-CLIP on COCO data with panoptic annotation [54]. We follow the 2017 splits which include 118k images for train split and 5k images for val split. If not specified, we train our model on the COCO train split and report results on val set of various datasets. License: Creative Commons Attribution 4.0 License URL: https://cocodataset.org/#home
Dataset Splits Yes We follow the 2017 splits which include 118k images for train split and 5k images for val split.
Hardware Specification Yes Furthermore, our model training only takes 25.6 V100 GPU days, which is 7.5 faster compared to ODISE s 192 V100 GPU days. All results are obtained with one V100 GPU, CUDA 11.6 and Py Torch 1.13, by taking the average runtime on the entire validation set, including post-processing time.
Software Dependencies Yes All results are obtained with one V100 GPU, CUDA 11.6 and Py Torch 1.13, by taking the average runtime on the entire validation set, including post-processing time.
Experiment Setup Yes Training Strategy We follow [20] and adopt the same training recipe and losses without any special design. The training is optimized with Adam W [41, 61] optimizer and weight decay 0.05. We use a crop size of 1024 1024. We employ the learning rate 1 10 4 and a multi-step decay schedule. The training batch size is 16, and the model is trained for 50 epochs on COCO panoptic training set [54].