Visual Semantics Allow for Textual Reasoning Better in Scene Text Recognition

Authors: Yue He, Chen Chen, Jing Zhang, Juhua Liu, Fengxiang He, Chaoyue Wang, Bo Du888-896

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental S-GTR sets new state-of-the-art on six challenging STR benchmarks and generalizes well to multi-linguistic datasets. Extensive experimental results indicate our S-GTR successfully sets new state-of-the-art for regular and irregular text recognition tasks as well as shows a superiority on both English and Chinese text materials.
Researcher Affiliation Collaboration 1 National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence, School of Computer Science and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, China 2 School of Computer Science, Faculty of Engineering, The University of Sydney, Australia 3 School of Printing and Packaging, and Institute of Artificial Intelligence, Wuhan University, China 4 JD Explore Academy, China
Pseudocode No The paper describes the methodology in prose and diagrams (Figure 2, Figure 3) but does not include any formal pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/adeline-cs/GTR.
Open Datasets Yes Following (Yu et al. 2020), we use two public synthetic datasets , i.e., Synth Text (ST) (Gupta, Vedaldi, and Zisserman 2016) and MJSynth (MJ) (Jaderberg et al. 2014b, 2016) and a real datasets (R) (Baek, Matsui, and Aizawa 2021) for training.
Dataset Splits No The paper does not explicitly provide the specific training/test/validation dataset splits (e.g., percentages, sample counts, or citations to predefined validation splits) within the main text.
Hardware Specification Yes The total batch size is 256, equally distributed on four NVIDIA V100 GPUs.
Software Dependencies No The paper mentions software components like "ADAM optimizer" and "FCN" but does not provide specific version numbers for any key software libraries or frameworks used.
Experiment Setup Yes We train the model with ADAM optimizer on two synthetic datasets for 6 epochs and then transferred to the real dataset for the other 2 epochs. The total batch size is 256, equally distributed on four NVIDIA V100 GPUs. For the pre-training stage on synthetic datasets, the learning rate is set to 0.001 and divided by 10 at the 4-th and 5-th epochs. ... Our model recognize 63 types of characters, including 0-9, a-z, and A-Z. The max decoding length of the output sequence T is set to 25. We follow the standard image pre-processing that randomly resizing the width of original images into 4 scales, i.e., 64, 128, 192 and 256, and then padding the images to the resolution of 64 x 256. We adopt multiple data augmentation strategies including random rotation, perspective distortion, motion blur, and adding Gaussian noise to the image.