DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in Transformer

Authors: Maoyuan Ye, Jing Zhang, Shanshan Zhao, Juhua Liu, Bo Du, Dacheng Tao

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments prove the high training efficiency, robustness, and state-of-the-art performance of our method on popular benchmarks.
Researcher Affiliation Collaboration Maoyuan Ye1, Jing Zhang2, Shanshan Zhao3, Juhua Liu1*, Bo Du4*, Dacheng Tao3,2 1 Research Center for Graphic Communication, Printing and Packaging, Institute of Artificial Intelligence, Wuhan University 2 The University of Sydney 3 JD Explore Academy 4 National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence, School of Computer Science and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University
Pseudocode No The paper describes its model architecture and components but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes The code and the Inverse-Text test set are available at https://github.com/ymy-k/DPText-DETR.
Open Datasets Yes Synth Text 150K (Liu et al. 2020b) is a synthesized dataset for arbitrary-shape scene text, containing 94,723 images with multi-oriented text and 54,327 images with curved text. Total-Text (Ch ng, Chan, and Liu 2020) consists of 1,255 training images and 300 test images. CTW1500 (Liu et al. 2019) contains 1,000 training images and 500 test images. ICDAR19 Ar T (Chng et al. 2019)... It contains 5,603 training images and 4,563 test images.
Dataset Splits No The paper provides training and test set sizes for the datasets used (Total-Text, CTW1500, ICDAR19 Ar T), but it does not explicitly provide specific details or counts for a separate validation set split.
Hardware Specification Yes Models are trained with 4 NVIDIA A100 (40GB) GPUs and tested with 1 GPU.
Software Dependencies No The paper mentions using Adam W optimizer and implies the use of a deep learning framework, but it does not specify version numbers for any software dependencies (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes The number of both encoder and decoder layers is set to 6. The composite queries number K is 100 and default control points number N is 16. The batch size is set to 8. The initial learning rate (lr) is 1 × 10−4 and is decayed to 1 × 10−5 at 280k. We finetune it on Total-Text for 20k iterations, with 5 × 10−5 lr which is divided by 10 at 16k. We use the Adam W optimizer (Loshchilov and Hutter 2019) with β1 = 0.9, β2 = 0.999 and weight decay of 10−4. Data augmentation strategies such as random crop, random blur, brightness adjusting, and color change are applied. We adopt multi-scale training strategy with the shortest edge ranging from 480 to 832, and the longest edge kept within 1600.