Open-Vocabulary Universal Image Segmentation with MaskCLIP
Authors: Zheng Ding, Jieke Wang, Zhuowen Tu
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this part, we train our proposed Mask CLIP method using COCO (Lin et al., 2014) training data and test on other datasets (ADE20K (Zhou et al., 2019; 2017), PASCAL Context (Mottaghi et al., 2014), LVIS) under the open vocabulary setting. We report our results on semantic/instance/panoptic segmentation tasks to evaluate the performance of out model s universal segmentation. |
| Researcher Affiliation | Academia | 1University of California San Diego, Lo Jolla, CA 92093, USA. Correspondence to: Zheng Ding <zhding@ucsd.edu>, Jieke Wang <jiw010@ucsd.edu>, Zhuowen Tu <ztu@ucsd.edu>. |
| Pseudocode | Yes | A. CLIP Baseline Details Here we provide more details on our CLIP Baseline. ... A formal algorithm is described as 1 and a visualization of this is shown as 6. Algorithm 1 CLIP Baseline Require: Mask proposal network fm, CLIP visual encoder fv, CLIP text encoder ft. Given an image I RH W 3 and a list T containing C category names. E = ft(T ). M = fm(I). for t = 1, 2, . . . , N do Ri = Mi I. Vi = fv(Ri). Yi = softmax(E Vi). end for |
| Open Source Code | Yes | Project website: https://maskclip.github.io. |
| Open Datasets | Yes | In this part, we train our proposed Mask CLIP method using COCO (Lin et al., 2014) training data and test on other datasets (ADE20K (Zhou et al., 2019; 2017), PASCAL Context (Mottaghi et al., 2014), LVIS) under the open vocabulary setting. |
| Dataset Splits | Yes | COCO: COCO (Lin et al., 2014) includes 133 classes where 80 classes are things and 53 classes are stuff or background. There are 118k training images and 5k validation images. (...) ADE20K: ADE20K (Zhou et al., 2019; 2017) contains 20,210 images and annotations for training and 2000 images and annotations for validation. (...) PASCAL Context: PASCAL Context (Mottaghi et al., 2014) contains 10,103 per-pixel annotations for images of PASCAL VOC 2010 (Everingham et al.), where 4998 for training and 5105 for validation. (...) LVIS: LVIS (Gupta et al., 2019) contains 100,170 images for training and 19,809 images for validation. |
| Hardware Specification | Yes | The training takes around 3h on 8 Nvidia A5000 GPUs. (...) For 100 masks feature extraction in a single image, the CLIP baseline takes ~3s on a single 3090 GPU while the Mask CLIP w/o RMA baseline only takes ~0.6s which is ~4x faster. |
| Software Dependencies | No | The paper mentions using 'Adam W (Loshchilov & Hutter, 2019) as our optimizer' but does not provide specific version numbers for any software dependencies like deep learning frameworks, programming languages, or other libraries. |
| Experiment Setup | Yes | We use Adam W (Loshchilov & Hutter, 2019) as our optimizer and the learning rate is set to 0.0001. We train our model on COCO training data for 10k iterations with a batch size of 8. (...) The loss function is L = λce Lce + λdice Ldice + λbce Lbce, where Lce is the loss for classification, Ldice and Lbce are the losses for mask localization. In our experiments, We set λce = 2, λdice = 5, λbce = 5. |