Learning Dynamic Query Combinations for Transformer-based Object Detection and Segmentation

Authors: Yiming Cui, Linjie Yang, Haichao Yu

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show the superior performance of our approach combined with a wide range of DETR-based models on MS COCO (Lin et al., 2014), City Scapes (Cordts et al., 2016) and You Tube-VIS (Yang et al., 2019b) benchmarks with multiple tasks, including object detection, instance segmentation, and panoptic segmentation.
Researcher Affiliation Collaboration 1Department of Electrical and Computer Engineering, University of Florida, Gainesville, USA 2Byte Dance Inc., San Jose, USA.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code (specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described in this paper.
Open Datasets Yes For the object detection task, we use MS COCO benchmark (Lin et al., 2014) for evaluation, which contains 118, 287 images for training and 5, 000 for validation.
Dataset Splits Yes For the object detection task, we use MS COCO benchmark (Lin et al., 2014) for evaluation, which contains 118, 287 images for training and 5, 000 for validation.
Hardware Specification Yes The training time is based on 8 NVIDIA A100 GPUs and the inference FPS is tested on a single TITAN RTX GPU.
Software Dependencies No The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup Yes The query ratio r used to generate the combination coefficients is set to 4 by default. β is set to be 1. θ is implemented as a two-layer MLP with Re LU as nonlinear activations. The output size of its first layer is 512, and that of the second layer is the length of W D in corresponding models. For detection models, we use 300 modulated queries and 1200 basic queries if not specified otherwise.