reproducibilityindex.ai

Perceiving Stroke-Semantic Context: Hierarchical Contrastive Learning for Robust Scene Text Recognition

Authors: Hao Liu, Bin Wang, Zhimin Bao, Mobai Xue, Sheng Kang, Deqiang Jiang, Yinsong Liu, Bo Ren1702-1710

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments in unand semi-supervised learning settings on STR benchmarks demonstrate our proposed framework can yield a more robust representation for both CTC-based and attention-based decoders than other contrastive learning methods. To fully investigate the potential of our method, we also collect a dataset of 100 million unlabeled text images, named UTI-100M, covering 5 scenes and 4 languages. By leveraging hundred-million-level unlabeled data, our Per Sec shows signiﬁcant performance improvement when ﬁne-tuning the learned representation on the labeled data.
Researcher Affiliation	Collaboration	1Tencent You Tu Lab 2University of Science and Technology of China {ivanhliu, bingolwang, zhiminbao, dqiangjiang, jasonysliu, timren}@tencent.com, {xmb15, ksc}@mail.ustc.edu.cn
Pseudocode	No	The paper includes architectural diagrams (Fig. 2, Fig. 3) but does not provide any pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain an explicit statement or a link indicating that the source code for the described methodology is publicly available.
Open Datasets	Yes	In this work, we adopt STR public datasets to evaluate the performance of the pre-trained model. The datasets cover three categories: 1) regular scene-text datasets including IC13 (Karatzas et al. 2013), IIIT5K (Mishra, Alahari, and Jawahar 2012) and SVT (Wang, Babenko, and Belongie 2011) ;2) irregular scene-text datasets: IC15 (Karatzas et al. 2015), SVTP (Phan et al. 2013) and CT80 (Risnumawan et al. 2014); 3) handwritten text datasets: IAM (Marti and Bunke 2002) and CVL (Kleber et al. 2013) . For regular and irregular scene-text recognition, we exploit the synthetic dataset ST (Gupta, Vedaldi, and Zisserman 2016) and MJ (Jaderberg etal. 2014) as training sets.
Dataset Splits	No	The paper mentions pre-training and fine-tuning on labeled data but does not specify the train/validation/test splits (e.g., percentages or sample counts) for these datasets.
Hardware Specification	Yes	All experiments are conducted on a total of 32 NVIDIA A100 GPUs with 80 GB RAM each.
Software Dependencies	No	The proposed self-supervised learning framework is implemented by Pytorch (Paszke et al. 2019). No specific version number for PyTorch or other libraries is provided.
Experiment Setup	Yes	In the pre-training stage, we normalize the input images to 32 384. In both stroke and semantic context perceivers, the head number of W-MHSA is set to 8 while the dimensions of all linear layers are set to 128. As for the window size ω in W-MHSA, we empirically set it to 1/2 of input feature height for stroke context perceiver and set it to 10 for semantic context perceiver. In the context perceivers, we set mask proportion plow to 0.2 and phigh to 0.15 at stroke and semantic levels, respectively. Correspondingly, the size of low-level stroke feature mask mlow and size of the high-level one mhigh are both set to 1. For quantizers at stroke and semantic level, there are 2 codebooks with 256 entries in each. In the Eqn. (4), loss weight parameter α is set to 0.2, while β is set to 0.1. ... And the training batch size is set to 2,048. ... We use Adam (Kingma and Ba 2014) optimizer and warm-up strategy with 1e-4 as the initial learning rate. ... we scale the backpropagated gradient at stroke context perceiver by 0.2 to stabilize the model training. Fine-tuning: ... all input images are normalized to 32 128... We use SGD optimizer with the 5e-3 initial learning rate. The training batch size is 2,048.