Towards Unified Multi-granularity Text Detection with Interactive Attention
Authors: Xingyu Wan, Chengquan Zhang, Pengyuan Lyu, Sen Fan, Zihan Ni, Kun Yao, Errui Ding, Jingdong Wang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that DAT achieves state-of-theart performances across a variety of text-related benchmarks, including multi-oriented/arbitrarilyshaped scene text detection, document layout analysis and page detection tasks. |
| Researcher Affiliation | Industry | 1Baidu, Beijing, China. |
| Pseudocode | No | The paper includes network diagrams (e.g., Figure 2) but does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any concrete access information (e.g., a specific repository link, an explicit code release statement, or code in supplementary materials) for the source code of the described methodology. |
| Open Datasets | Yes | For word detection, we used the ICDAR2015 (Karatzas et al., 2015) and Total Text (Ch ng et al., 2020) datasets; for line detection, CTW1500 (Liu et al., 2019) and MSRA-TD500 (Yao et al., 2012) were employed; M6Doc (Cheng et al., 2023) facilitated our document layout analysis; and DIW (Ma et al., 2022) was the choice for page detection. |
| Dataset Splits | No | The paper mentions using various datasets for training and evaluation (e.g., ICDAR2015, Totaltext, M6Doc, DIW), and refers to 'benchmark test sets'. However, it does not provide specific details on how these datasets were split into training, validation, or test sets (e.g., percentages, sample counts, or explicit references to predefined splits for all datasets used). |
| Hardware Specification | Yes | Our multi-granularity text detection framework was implemented on 8 NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions using specific models like "Swin Transformer Large (Swin L) pretrained on Image Net-22K", "DINO (Zhang et al., 2022)", and following "SAM (Kirillov et al., 2023)". However, it does not specify versions for underlying software dependencies such as Python, PyTorch, or TensorFlow, which are necessary for full reproducibility. |
| Experiment Setup | Yes | During the model training phase, the batch size is set to 8 (1 per single GPU). The query number Nq of each group (Sec 3.2) is set to 900. We train our full DAT model using public datasets of all granularities, with total 120 epochs. The base learning rate is 1e-4 and reduced to 1e-5 at the 66-th epoch and 1e-6 at the 99-th epoch. For our proposed mixed-granularity training(Sec 3.2), the weight of l1 loss is 5.0, the weight of GIoU loss is 2.0, and the weight of focal loss is 1.0. We choose AdamW with a weight decay parameter of 1e-4 as our optimizer. The number of both encoder and decoder layers is set to 6. |