Visual Semantics Allow for Textual Reasoning Better in Scene Text Recognition
Authors: Yue He, Chen Chen, Jing Zhang, Juhua Liu, Fengxiang He, Chaoyue Wang, Bo Du888-896
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | S-GTR sets new state-of-the-art on six challenging STR benchmarks and generalizes well to multi-linguistic datasets. Extensive experimental results indicate our S-GTR successfully sets new state-of-the-art for regular and irregular text recognition tasks as well as shows a superiority on both English and Chinese text materials. |
| Researcher Affiliation | Collaboration | 1 National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence, School of Computer Science and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, China 2 School of Computer Science, Faculty of Engineering, The University of Sydney, Australia 3 School of Printing and Packaging, and Institute of Artificial Intelligence, Wuhan University, China 4 JD Explore Academy, China |
| Pseudocode | No | The paper describes the methodology in prose and diagrams (Figure 2, Figure 3) but does not include any formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/adeline-cs/GTR. |
| Open Datasets | Yes | Following (Yu et al. 2020), we use two public synthetic datasets , i.e., Synth Text (ST) (Gupta, Vedaldi, and Zisserman 2016) and MJSynth (MJ) (Jaderberg et al. 2014b, 2016) and a real datasets (R) (Baek, Matsui, and Aizawa 2021) for training. |
| Dataset Splits | No | The paper does not explicitly provide the specific training/test/validation dataset splits (e.g., percentages, sample counts, or citations to predefined validation splits) within the main text. |
| Hardware Specification | Yes | The total batch size is 256, equally distributed on four NVIDIA V100 GPUs. |
| Software Dependencies | No | The paper mentions software components like "ADAM optimizer" and "FCN" but does not provide specific version numbers for any key software libraries or frameworks used. |
| Experiment Setup | Yes | We train the model with ADAM optimizer on two synthetic datasets for 6 epochs and then transferred to the real dataset for the other 2 epochs. The total batch size is 256, equally distributed on four NVIDIA V100 GPUs. For the pre-training stage on synthetic datasets, the learning rate is set to 0.001 and divided by 10 at the 4-th and 5-th epochs. ... Our model recognize 63 types of characters, including 0-9, a-z, and A-Z. The max decoding length of the output sequence T is set to 25. We follow the standard image pre-processing that randomly resizing the width of original images into 4 scales, i.e., 64, 128, 192 and 256, and then padding the images to the resolution of 64 x 256. We adopt multiple data augmentation strategies including random rotation, perspective distortion, motion blur, and adding Gaussian noise to the image. |