TextScanner: Reading Characters in Order for Robust Scene Text Recognition

Authors: Zhaoyi Wan, Minghang He, Haoran Chen, Xiang Bai, Cong Yao12120-12127

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experiments on standard benchmark datasets demonstrate that Text Scanner outperforms the state-of-the-art methods.
Researcher Affiliation Collaboration 1Megvii, 2Huazhong University of Science and Technology, 3Beijing Institute of Technology
Pseudocode No No pseudocode or clearly labeled algorithm block was found in the paper.
Open Source Code No The paper does not provide an explicit statement or link for open-source code for the methodology described.
Open Datasets Yes Datasets ICDAR 2013(IC13) (Karatzas et al. 2013) recognition task provides 288 scene images with annotations, from which 1015 word images are cropped. Besides, the dataset provides character-level bounding boxes. ICDAR 2015(IC15) (Karatzas et al. 2015) consists of 1000 images with word-level quadrangles annotation for training and 500 for testing. IIIT 5K-Words(IIIT) (Mishra, Alahari, and Jawahar 2012) dataset contains 5K word images for scene text recognition. Street View Text(SVT) (Wang, Babenko, and Belongie 2011) dataset has 350 images and only word-level annotations are provided. SVT-Perspective(SVTP) (Phan et al. 2013) dataset contains 639 cropped images for testing. Many images in the dataset are heavily distorted. CUTE80(CT) (Risnumawan et al. 2014) dataset is taken in natural scene. It consists of 80 high-resolution images with no lexicon. ICDAR 2017 MLT(MLT-2017) (Nayef et al. 2017) is comprised of 9000 training images and 9000 test images. We acquire cropped word instances for recognition by using the quadrilateral word-level annotation. Synth Text (Gupta, Vedaldi, and Zisserman 2016) consists of 80k images for training. We cropped about 7 million instances with character and word-level bounding-boxes annotations from the training set. Synth90k (Jaderberg et al. 2014b) contains 8 millions word images from 90k English words with word-level annotation.
Dataset Splits No The paper mentions training and test sets (e.g., '1000 for training and 500 for testing' for IC15, '9000 training images and 9000 test images' for MLT-17), but does not explicitly provide details about a dedicated validation split (percentages, counts, or explicit mention of a separate validation set) used for hyperparameter tuning or early stopping during training.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or cloud instance specifications) used for running its experiments.
Software Dependencies No The paper does not provide specific software dependency details, such as library names with version numbers, needed to replicate the experiment.
Experiment Setup Yes The learning rate is initialized as 10 3 and the decays to 10 4 and 10 5. During training and inference, the input images are resized to 64 256. We use Adam optimizer for training of all experiments. The score threshold ζscore is set as 0.3 empirically, and the max size N is set as 32 in our implementations.