reproducibilityindex.ai

Open-Vocabulary Universal Image Segmentation with MaskCLIP

Authors: Zheng Ding, Jieke Wang, Zhuowen Tu

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this part, we train our proposed Mask CLIP method using COCO (Lin et al., 2014) training data and test on other datasets (ADE20K (Zhou et al., 2019; 2017), PASCAL Context (Mottaghi et al., 2014), LVIS) under the open vocabulary setting. We report our results on semantic/instance/panoptic segmentation tasks to evaluate the performance of out model s universal segmentation.
Researcher Affiliation	Academia	1University of California San Diego, Lo Jolla, CA 92093, USA. Correspondence to: Zheng Ding <zhding@ucsd.edu>, Jieke Wang <jiw010@ucsd.edu>, Zhuowen Tu <ztu@ucsd.edu>.
Pseudocode	Yes	A. CLIP Baseline Details Here we provide more details on our CLIP Baseline. ... A formal algorithm is described as 1 and a visualization of this is shown as 6. Algorithm 1 CLIP Baseline Require: Mask proposal network fm, CLIP visual encoder fv, CLIP text encoder ft. Given an image I RH W 3 and a list T containing C category names. E = ft(T ). M = fm(I). for t = 1, 2, . . . , N do Ri = Mi I. Vi = fv(Ri). Yi = softmax(E Vi). end for
Open Source Code	Yes	Project website: https://maskclip.github.io.
Open Datasets	Yes	In this part, we train our proposed Mask CLIP method using COCO (Lin et al., 2014) training data and test on other datasets (ADE20K (Zhou et al., 2019; 2017), PASCAL Context (Mottaghi et al., 2014), LVIS) under the open vocabulary setting.
Dataset Splits	Yes	COCO: COCO (Lin et al., 2014) includes 133 classes where 80 classes are things and 53 classes are stuff or background. There are 118k training images and 5k validation images. (...) ADE20K: ADE20K (Zhou et al., 2019; 2017) contains 20,210 images and annotations for training and 2000 images and annotations for validation. (...) PASCAL Context: PASCAL Context (Mottaghi et al., 2014) contains 10,103 per-pixel annotations for images of PASCAL VOC 2010 (Everingham et al.), where 4998 for training and 5105 for validation. (...) LVIS: LVIS (Gupta et al., 2019) contains 100,170 images for training and 19,809 images for validation.
Hardware Specification	Yes	The training takes around 3h on 8 Nvidia A5000 GPUs. (...) For 100 masks feature extraction in a single image, the CLIP baseline takes ~3s on a single 3090 GPU while the Mask CLIP w/o RMA baseline only takes ~0.6s which is ~4x faster.
Software Dependencies	No	The paper mentions using 'Adam W (Loshchilov & Hutter, 2019) as our optimizer' but does not provide specific version numbers for any software dependencies like deep learning frameworks, programming languages, or other libraries.
Experiment Setup	Yes	We use Adam W (Loshchilov & Hutter, 2019) as our optimizer and the learning rate is set to 0.0001. We train our model on COCO training data for 10k iterations with a batch size of 8. (...) The loss function is L = λce Lce + λdice Ldice + λbce Lbce, where Lce is the loss for classification, Ldice and Lbce are the losses for mask localization. In our experiments, We set λce = 2, λdice = 5, λbce = 5.