DaTaSeg: Taming a Universal Multi-Dataset Multi-Task Segmentation Model
Authors: Xiuye Gu, Yin Cui, Jonathan Huang, Abdullah Rashwan, Xuan Yang, Xingyi Zhou, Golnaz Ghiasi, Weicheng Kuo, Huizhong Chen, Liang-Chieh Chen, David Ross
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We train Da Ta Seg on ADE semantic, COCO panoptic, and Objects365 detection datasets. Da Ta Seg improves performance on all datasets, especially small-scale datasets, achieving 54.0 m Io U on ADE semantic and 53.5 PQ on COCO panoptic. Experiments show Da Ta Seg scales with the number of training datasets and enables open-vocabulary segmentation through direct transfer. |
| Researcher Affiliation | Industry | Xiuye Gu Yin Cui Jonathan Huang Abdullah Rashwan Xuan Yang Xingyi Zhou Golnaz Ghiasi Weicheng Kuo Huizhong Chen Liang-Chieh Chen David Ross Google Research |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states: "In addition, we annotate an Objects365 instance segmentation set of 1,000 images and release it as a public evaluation benchmark on https://laoreja.github.io/dataseg." This refers to a dataset release, not the open-sourcing of the Da Ta Seg methodology code. |
| Open Datasets | Yes | We train and evaluate Da Ta Seg on COCO panoptic [29] and ADE20k semantic [75] using mask supervision, as well as Objects365-v2 [56] detection datasets using bounding box weak supervision. ... We annotate an Objects365 instance segmentation set of 1,000 images and release it as a public evaluation benchmark on https://laoreja.github.io/dataseg. |
| Dataset Splits | Yes | COCO panoptic is the most popular panoptic segmentation benchmark with 118,287 training images and 5,000 validation images. COCO has 80 thing categories and 53 stuff categories. ADE20k semantic is one of the most widely used semantic segmentation benchmarks with 150 categories, 20,210 training images, and 2,000 validation images. |
| Hardware Specification | Yes | For Res Net50, we train on 64 TPU v4 chips; for Vi TDet backbones, we train on 128 TPU v4 chips. All evaluations are conducted on 4 V100 GPUs with a batch size of 8. |
| Software Dependencies | No | The paper mentions using Adam W optimizer and CLIP-L/14 as a pretrained text encoder, but does not provide specific version numbers for any software dependencies like programming languages, libraries, or frameworks (e.g., PyTorch, TensorFlow, CUDA versions). |
| Experiment Setup | Yes | We randomly scale the input image in the range of [0.1, 2.0] and then pad or crop it to 1024 1024. For ADE20k dataset, since the image size is smaller than other datasets, we use a scaling range of [0.5, 2.0]. We use the Adam W optimizer [46] with a weight decay of 0.05. We clip the gradients with a max norm of 0.1. The weight for the background class is set to 0.05 in Lce. The matching cost and loss weight settings for Eqn. 4,5 in the main paper are shown in Table 13. We use a dataset sampling ratio of 1:4:4 for ADE semantic, COCO panoptic, and Objects365 detection. We adopt a different learning rate multiplier for each dataset: We multiply the learning rate on ADE semantic, COCO panoptic, and Objects365 detection by 3, 5, 2, respectively. Since Objects365 detection dataset has a large vocabulary with imbalanced distribution, we apply repeat factor sampling with a frequency threshold t = 0.01 [18]. On Res Net50 backbones, we use a batch size of 384 and train 500k iterations, with a learning rate of 3e-5. We adopt the step learning rate schedule: We multiply the learning rate by 0.1 at the 0.9 and 0.95 fractions of the total training iterations. On the Vi TDet-B backbones, we train 600k iterations with a learning rate of 6e-5. On Vi TDet-L, we use a batch size of 256 and train 540.5k iterations with a learning rate of 4e-5. |