Perceiving Stroke-Semantic Context: Hierarchical Contrastive Learning for Robust Scene Text Recognition
Authors: Hao Liu, Bin Wang, Zhimin Bao, Mobai Xue, Sheng Kang, Deqiang Jiang, Yinsong Liu, Bo Ren1702-1710
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments in unand semi-supervised learning settings on STR benchmarks demonstrate our proposed framework can yield a more robust representation for both CTC-based and attention-based decoders than other contrastive learning methods. To fully investigate the potential of our method, we also collect a dataset of 100 million unlabeled text images, named UTI-100M, covering 5 scenes and 4 languages. By leveraging hundred-million-level unlabeled data, our Per Sec shows significant performance improvement when fine-tuning the learned representation on the labeled data. |
| Researcher Affiliation | Collaboration | 1Tencent You Tu Lab 2University of Science and Technology of China {ivanhliu, bingolwang, zhiminbao, dqiangjiang, jasonysliu, timren}@tencent.com, {xmb15, ksc}@mail.ustc.edu.cn |
| Pseudocode | No | The paper includes architectural diagrams (Fig. 2, Fig. 3) but does not provide any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement or a link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | In this work, we adopt STR public datasets to evaluate the performance of the pre-trained model. The datasets cover three categories: 1) regular scene-text datasets including IC13 (Karatzas et al. 2013), IIIT5K (Mishra, Alahari, and Jawahar 2012) and SVT (Wang, Babenko, and Belongie 2011) ;2) irregular scene-text datasets: IC15 (Karatzas et al. 2015), SVTP (Phan et al. 2013) and CT80 (Risnumawan et al. 2014); 3) handwritten text datasets: IAM (Marti and Bunke 2002) and CVL (Kleber et al. 2013) . For regular and irregular scene-text recognition, we exploit the synthetic dataset ST (Gupta, Vedaldi, and Zisserman 2016) and MJ (Jaderberg etal. 2014) as training sets. |
| Dataset Splits | No | The paper mentions pre-training and fine-tuning on labeled data but does not specify the train/validation/test splits (e.g., percentages or sample counts) for these datasets. |
| Hardware Specification | Yes | All experiments are conducted on a total of 32 NVIDIA A100 GPUs with 80 GB RAM each. |
| Software Dependencies | No | The proposed self-supervised learning framework is implemented by Pytorch (Paszke et al. 2019). No specific version number for PyTorch or other libraries is provided. |
| Experiment Setup | Yes | In the pre-training stage, we normalize the input images to 32 384. In both stroke and semantic context perceivers, the head number of W-MHSA is set to 8 while the dimensions of all linear layers are set to 128. As for the window size ω in W-MHSA, we empirically set it to 1/2 of input feature height for stroke context perceiver and set it to 10 for semantic context perceiver. In the context perceivers, we set mask proportion plow to 0.2 and phigh to 0.15 at stroke and semantic levels, respectively. Correspondingly, the size of low-level stroke feature mask mlow and size of the high-level one mhigh are both set to 1. For quantizers at stroke and semantic level, there are 2 codebooks with 256 entries in each. In the Eqn. (4), loss weight parameter α is set to 0.2, while β is set to 0.1. ... And the training batch size is set to 2,048. ... We use Adam (Kingma and Ba 2014) optimizer and warm-up strategy with 1e-4 as the initial learning rate. ... we scale the backpropagated gradient at stroke context perceiver by 0.2 to stabilize the model training. Fine-tuning: ... all input images are normalized to 32 128... We use SGD optimizer with the 5e-3 initial learning rate. The training batch size is 2,048. |