Context-Based Contrastive Learning for Scene Text Recognition
Authors: Xinyun Zhang, Binwu Zhu, Xufeng Yao, Qi Sun, Ruiyu Li, Bei Yu3353-3361
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that Con CLR significantly improves out-of-vocabulary generalization and achieves stateof-the-art performance on public benchmarks together with attention-based recognizers. ... In this section, we conduct extensive experiments to demonstrate the effectiveness of our proposed method. |
| Researcher Affiliation | Collaboration | 1The Chinese University of Hong Kong 2Smart More {xyzhang21,bwzhu,xfyao,qsun,byu}@cse.cuhk.edu.hk, royliruiyu@gmail.com |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | Yes | The training set consists of two synthetic datasets, MJ (Jaderberg et al. 2016, 2014) and ST (Gupta, Vedaldi, and Zisserman 2016) |
| Dataset Splits | Yes | The training set consists of two synthetic datasets, MJ (Jaderberg et al. 2016, 2014) and ST (Gupta, Vedaldi, and Zisserman 2016), and evaluation is conducted on six public benchmarks, including ICDAR 2013 (IC13) (Karatzas et al. 2013), ICDAR 2015 (IC15) (Karatzas et al. 2015), IIIT 5K-Words (IIIT) (Mishra, Alahari, and Jawahar 2012), Street View Text (SVT) (Wang, Babenko, and Belongie 2011), Street View Text Perspective (SVTP) (Phan et al. 2013), and CUTE80 (CUTE) (Risnumawan etm al. 2014), and our synthesized benchmark Out Text. |
| Hardware Specification | Yes | All the experiments are conducted on four NVIDIA 2080Ti GPUs with batch size 384. |
| Software Dependencies | No | The paper mentions software components like ResNet, Transformer, and ADAM optimizer but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We use three transformer layers for the parallel attention module, with eight heads for each of them. Images are resized to 32 128 with common data augmentation, such as random rotation, affine transformation, color jittering, and etc. We use ADAM as the optimizer, with a learning rate initialized to 1e 4 and decayed to 1e 5 at the 6-th epoch. All the experiments are conducted on four NVIDIA 2080Ti GPUs with batch size 384. |