X-Paste: Revisiting Scalable Copy-Paste for Instance Segmentation using CLIP and StableDiffusion
Authors: Hanqing Zhao, Dianmo Sheng, Jianmin Bao, Dongdong Chen, Dong Chen, Fang Wen, Lu Yuan, Ce Liu, Wenbo Zhou, Qi Chu, Weiming Zhang, Nenghai Yu
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On the LVIS dataset, X-Paste provides impressive improvements over the strong baseline Center Net2 with Swin-L as the backbone. Specifically, it archives +2.6 box AP and +2.1 mask AP gains on all classes and even more significant gains with +6.8 box AP +6.5 mask AP on longtail classes. Our code and models are available at https://github.com/yoctta/XPaste. We perform extensive experiments to validate the superiority of X-Paste |
| Researcher Affiliation | Collaboration | 1University of Science and Technology of China 2Microsoft. Correspondence to: Jianmin Bao <jianmin.bao@microsoft.com>, Wenbo Zhou <welbeckz@ustc.edu.cn>. |
| Pseudocode | No | The paper describes its methodology in narrative text and block diagrams (Figure 1) but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code and models are available at https://github.com/yoctta/XPaste. |
| Open Datasets | Yes | Datasets. We conduct experiments of object detection and instance segmentation on LVIS (Gupta et al., 2019) and MS-COCO (Lin et al., 2014) datasets. |
| Dataset Splits | Yes | LVIS dataset contains 100k training images, and 20k validation images. It has 1203 categories... MS-COCO dataset contains 118K training, 5K validation, and 20K test-dev images. We use the official split for training. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, or detailed specifications of the machines used for running experiments. |
| Software Dependencies | No | The paper mentions software frameworks and models like Center Net2, Detectron2, Stable Diffusion V1.4, and CLIP model, but does not provide specific version numbers for all key software dependencies required for reproduction (e.g., Detectron2 version, PyTorch version). |
| Experiment Setup | Yes | The training configurations are set as follows: training resolution is set to 640, the batch size is 32, and 4 schedule (48 epochs). ... For Stable Diffusion, the diffusion steps is set to 200 with the classifier-free scale set to 5.0. ... For Instance Filtering, we set the CLIP threshold as 0.21 to filter all the obtained instances. ... When the instance occludes an object in the background image, we remove fully occluded objects and update mask and bounding box annotations accordingly. ... the number of instances pasted to each background image is set to 20 for training. |