All You Need Is Boundary: Toward Arbitrary-Shaped Text Spotting

Authors: Hao Wang, Pu Lu, Hui Zhang, Mingkun Yang, Xiang Bai, Yongchao Xu, Mengchao He, Yongpan Wang, Wenyu Liu12160-12167

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on three challenging datasets, including ICDAR2015, Total Text and COCO-Text demonstrate that the proposed method consistently surpasses the state-of-the-art in both scene text detection and end-to-end text recognition tasks.
Researcher Affiliation Collaboration 1Huazhong University of Science and Technology, 2Alibaba Group {wanghao4659, lupu, huizhang0110, yangmingkun, xbai, yongchaoxu, liuwy}@hust.edu.cn mengchao.hmc@alibaba-inc.com, yongpan@taobao.com
Pseudocode Yes Algorithm 1 Generate Target Boundary Points
Open Source Code No The paper does not include an explicit statement about open-sourcing the code or provide a link to a code repository.
Open Datasets Yes To confirm the effectiveness of the proposed method on arbitrary-shaped text spotting, we conduct exhaustive experiments and compare with other state-of-the-art methods on four popular benchmarks which consist of a horizontal text set ICDAR2013 (Karatzas et al. 2013), two oriented text sets ICDAR2015 (Karatzas et al. 2015) and COCO-Text (Veit et al. 2016), a curved text set Total Text (Ch ng and Liu 2019). The details about these datasets are as follows. Datasets Synth Text (Gupta, Vedaldi, and Zisserman 2016) has about 800,000 images, which are generated via synthesizing engine.
Dataset Splits Yes Total Text contains 1,255 training images and 300 test images. All images are annotated with polygons and transcriptions in word-level. ICDAR2015 focuses on multi-oriented scene text detection and recognition in natural images. There are 1,000 training images and 500 test images. Word-level quadrangles and transcriptions of each image are given. ICDAR2013 is a dataset which focuses on the horizontal scene text detection and recognition in natural images. The dataset consists of 229 images in the training set and 233 images in the test set.
Hardware Specification Yes We implement our method in Pytorch and conduct all experiments on a regular workstation with Nvidia Titan Xp GPUs.
Software Dependencies No The paper mentions using 'Pytorch' for implementation, but does not specify a version number or list other software dependencies with their versions.
Experiment Setup Yes During pretraining, the mini-batch is set to 16, and the longer sides of input images are resized to 800 while keeping the aspect ratio. The maximum number of proposals in each image on the recognition branch is set to 16. In the finetuning stage, for the data augmentation, we randomly crop a patch whose edges range from 210 to 1100 while keeping all the text instance not cropped and resize the patch to (640, 640). Finally, the resized patch is randomly rotated 90 with a probability of 0.2. We collect the training images from ICDAR2013, ICDAR2015, and Total Text to finetune the model with the mini-batch set to 16. We optimize our model using SGD with a weight decay of 0.0001 and momentum of 0.9. We train our model for 270k iterations for pretraining, with an initial learning rate of 0.01, and decayed to a tenth at the 100k and the 200k iteration. In the finetuning stage, the initial learning rate is set to 0.001 and then is decreased to 0.0001 and 0.00001 at the 80k and 120k iteration. The finetuning process is terminated at the 140k iteration.