DI-MaskDINO: A Joint Object Detection and Instance Segmentation Model
Authors: Zhixiong Nan, Li Xianghong, Tao Xiang, Jifeng Dai
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper is motivated by an interesting phenomenon: the performance of object detection lags behind that of instance segmentation (i.e., performance imbalance) when investigating the intermediate results from the beginning transformer decoder layer of Mask DINO (i.e., the SOTA model for joint detection and segmentation). This phenomenon inspires us to think about a question: will the performance imbalance at the beginning layer of transformer decoder constrain the upper bound of the final performance? With this question in mind, we further conduct qualitative and quantitative pre-experiments, which validate the negative impact of detectionsegmentation imbalance issue on the model performance. To address this issue, this paper proposes DI-Mask DINO model, the core idea of which is to improve the final performance by alleviating the detection-segmentation imbalance. DIMask DINO is implemented by configuring our proposed De-Imbalance (DI) module and Balance-Aware Tokens Optimization (BATO) module to Mask DINO. DI is responsible for generating balance-aware query, and BATO uses the balanceaware query to guide the optimization of the initial feature tokens. The balanceaware query and optimized feature tokens are respectively taken as the Query and Key&Value of transformer decoder to perform joint object detection and instance segmentation. DI-Mask DINO outperforms existing joint object detection and instance segmentation models on COCO and BDD100K benchmarks, achieving +1.2 AP box and +0.9 AP mask improvements compared to SOTA joint detection and segmentation model Mask DINO. In addition, DI-Mask DINO also obtains +1.0 AP box improvement compared to SOTA object detection model DINO and +3.0 AP mask improvement compared to SOTA segmentation model Mask2Former. |
| Researcher Affiliation | Academia | 1College of Computer Science, Chongqing University, Chongqing, China. 2Department of Electronic Engineering, Tsinghua University, Beijing, China. |
| Pseudocode | No | The paper describes the proposed method and modules using descriptive text and a system diagram (Figure 2), but it does not include any formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | This paper presents a detailed description of our proposed DI-Mask DINO model in 3, along with comprehensive implementation details provided in Appendix B. We will release the source code of our model at the following URL: https://github. com/CQU-ADHRI-Lab/DI-Mask DINO. |
| Open Datasets | Yes | We conduct extensive experiments on COCO [26] and BDD100K [49] datasets |
| Dataset Splits | Yes | We use the COCO train2017 split (118k images) for training and the val2017 split (5k images) for validation. In addition, considering autonomous driving is a typical and practical application of object detection and instance segmentation, the experiments are also conducted on BDD100K [49] dataset, which is composed of 10k high-quality instance masks and bounding boxes annotations for 8 classes. The training set and validation set are divided following the standard in [49]. |
| Hardware Specification | Yes | NVIDIA RTX3090 GPUs are used when the backbone is Res Net50. Due to the large memory requirement of Swin L, NVIDIA RTX A6000 GPUs are used when the backbone is Swin L. |
| Software Dependencies | No | The paper states: "We implement DI-Mask DINO based on Detectron2 [48], using Adam W [31] optimizer with a step learning rate schedule." While specific software components are named, their version numbers are not provided. |
| Experiment Setup | Yes | The initial learning rate is set as 0.0001. Following Mask DINO, DIMask DINO is trained for 50 epochs on COCO with the batch size of 16, decaying the initial learning rate at fractions 0.9 and 0.95 of the total training iterations by a factor of 0.1. For BDD100K, following the setting in [22], we train our model for 68 epochs with the batch size of 8 and the learning rate drops at the 50-th epoch. The number of transformer encoder and decoder layers is 6. The token numbers of Ts1 and Ts2 are 600 and 300, respectively. Unless otherwise specified, the feature channels in both encoder and decoder are set to 256, and the hidden dimension of FFN is set to 2048. The mask network and box network in BATO are both three-layer mlp networks. We use the same loss function as Mask DINO (i.e., L1 loss and GIOU loss for box loss, focal loss for classification loss, and cross-entropy loss and dice loss for mask loss). |