SVTR: Scene Text Recognition with a Single Visual Model
Authors: Yongkun Du, Zhineng Chen, Caiyan Jia, Xiaoting Yin, Tianlun Zheng, Chenxia Li, Yuning Du, Yu-Gang Jiang
IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on both English and Chinese scene text recognition tasks demonstrate the effectiveness of SVTR. SVTR-L (Large) achieves highly competitive accuracy in English and outperforms existing methods by a large margin in Chinese, while running faster. In addition, SVTR-T (Tiny) is an effective and much smaller model, which shows appealing speed at inference. The code is publicly available at https://github.com/Paddle Paddle/Paddle OCR. |
| Researcher Affiliation | Collaboration | 1School of Computer and Information Technology and Beijing Key Lab of Traffic Data Analysis and Mining, Beijing Jiaotong University, China 2Shanghai Collaborative Innovation Center of Intelligent Visual Computing, School of Computer Science, Fudan University, China 3Baidu Inc., China {yongkundu, cyjia}@bjtu.edu.cn, {zhinchen, ygj}@fudan.edu.cn, tlzheng21@m.fudan.edu.cn, {yinxiaoting, lichenxia, duyuning}@baidu.com |
| Pseudocode | No | The paper describes the overall architecture and components but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is publicly available at https://github.com/Paddle Paddle/Paddle OCR. |
| Open Datasets | Yes | For English recognition task, our models are trained on two commonly used synthetic scene text datasets, i.e., MJSynth (MJ) [Jaderberg et al., 2014; Jaderberg et al., 2015] and Synth Text (ST) [Gupta et al., 2016]. For Chinese recognition task, we use the Chinese Scene Dataset [Chen et al., 2021]. It is a public dataset containing 509,164, 63,645 and 63,646 training, validation, and test images. |
| Dataset Splits | Yes | For Chinese recognition task, we use the Chinese Scene Dataset [Chen et al., 2021]. It is a public dataset containing 509,164, 63,645 and 63,646 training, validation, and test images. The validation set is utilized to determine the best model, which is then assessed using the test set. |
| Hardware Specification | Yes | All models are trained by using 4 Tesla V100 GPUs on Paddle Paddle. While SVTR-T is effective yet efficient, with parameters of 6.03M and consuming 4.5ms per image text on average in one NVIDIA 1080Ti GPU. |
| Software Dependencies | No | The paper mentions 'Paddle Paddle' as the framework used, but does not provide specific version numbers for it or any other key software dependencies. |
| Experiment Setup | Yes | We use the Adam W optimizer with weight decay of 0.05 for training. For English models, the initial learning rate are set to 5 104 batchsize / 2048 . The cosine learning rate scheduler with 2 epochs linear warm-up is used in all 21 epochs. Data augmentation like rotation, perspective distortion, motion blur and Gaussian noise, are randomly performed during the training. The alphabet includes all case-insensitive alphanumerics. The maximum predict length is set to 25. For Chinese models, the initial learning rate are set to 3 104 batchsize / 512 . The cosine learning rate scheduler with 5 epochs linear warm-up is used in all 100 epochs. |