Towards Unified Multi-granularity Text Detection with Interactive Attention

Authors: Xingyu Wan, Chengquan Zhang, Pengyuan Lyu, Sen Fan, Zihan Ni, Kun Yao, Errui Ding, Jingdong Wang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that DAT achieves state-of-theart performances across a variety of text-related benchmarks, including multi-oriented/arbitrarilyshaped scene text detection, document layout analysis and page detection tasks.
Researcher Affiliation Industry 1Baidu, Beijing, China.
Pseudocode No The paper includes network diagrams (e.g., Figure 2) but does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any concrete access information (e.g., a specific repository link, an explicit code release statement, or code in supplementary materials) for the source code of the described methodology.
Open Datasets Yes For word detection, we used the ICDAR2015 (Karatzas et al., 2015) and Total Text (Ch ng et al., 2020) datasets; for line detection, CTW1500 (Liu et al., 2019) and MSRA-TD500 (Yao et al., 2012) were employed; M6Doc (Cheng et al., 2023) facilitated our document layout analysis; and DIW (Ma et al., 2022) was the choice for page detection.
Dataset Splits No The paper mentions using various datasets for training and evaluation (e.g., ICDAR2015, Totaltext, M6Doc, DIW), and refers to 'benchmark test sets'. However, it does not provide specific details on how these datasets were split into training, validation, or test sets (e.g., percentages, sample counts, or explicit references to predefined splits for all datasets used).
Hardware Specification Yes Our multi-granularity text detection framework was implemented on 8 NVIDIA A100 GPUs.
Software Dependencies No The paper mentions using specific models like "Swin Transformer Large (Swin L) pretrained on Image Net-22K", "DINO (Zhang et al., 2022)", and following "SAM (Kirillov et al., 2023)". However, it does not specify versions for underlying software dependencies such as Python, PyTorch, or TensorFlow, which are necessary for full reproducibility.
Experiment Setup Yes During the model training phase, the batch size is set to 8 (1 per single GPU). The query number Nq of each group (Sec 3.2) is set to 900. We train our full DAT model using public datasets of all granularities, with total 120 epochs. The base learning rate is 1e-4 and reduced to 1e-5 at the 66-th epoch and 1e-6 at the 99-th epoch. For our proposed mixed-granularity training(Sec 3.2), the weight of l1 loss is 5.0, the weight of GIoU loss is 2.0, and the weight of focal loss is 1.0. We choose AdamW with a weight decay parameter of 1e-4 as our optimizer. The number of both encoder and decoder layers is set to 6.