Learning Dynamic Query Combinations for Transformer-based Object Detection and Segmentation
Authors: Yiming Cui, Linjie Yang, Haichao Yu
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show the superior performance of our approach combined with a wide range of DETR-based models on MS COCO (Lin et al., 2014), City Scapes (Cordts et al., 2016) and You Tube-VIS (Yang et al., 2019b) benchmarks with multiple tasks, including object detection, instance segmentation, and panoptic segmentation. |
| Researcher Affiliation | Collaboration | 1Department of Electrical and Computer Engineering, University of Florida, Gainesville, USA 2Byte Dance Inc., San Jose, USA. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code (specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described in this paper. |
| Open Datasets | Yes | For the object detection task, we use MS COCO benchmark (Lin et al., 2014) for evaluation, which contains 118, 287 images for training and 5, 000 for validation. |
| Dataset Splits | Yes | For the object detection task, we use MS COCO benchmark (Lin et al., 2014) for evaluation, which contains 118, 287 images for training and 5, 000 for validation. |
| Hardware Specification | Yes | The training time is based on 8 NVIDIA A100 GPUs and the inference FPS is tested on a single TITAN RTX GPU. |
| Software Dependencies | No | The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | Yes | The query ratio r used to generate the combination coefficients is set to 4 by default. β is set to be 1. θ is implemented as a two-layer MLP with Re LU as nonlinear activations. The output size of its first layer is 512, and that of the second layer is the length of W D in corresponding models. For detection models, we use 300 modulated queries and 1200 basic queries if not specified otherwise. |