Self-Supervised Pre-training with Symmetric Superimposition Modeling for Scene Text Recognition

Authors: Zuan Gao, Yuxin Wang, Yadong Qu, Boqiang Zhang, Zixiao Wang, Jianjun Xu, Hongtao Xie

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on various text recognition benchmarks demonstrate the effectiveness and generality of SSM, with 4.1% average performance gains and 86.6% new state-of-the-art average word accuracy on Union14M benchmarks.
Researcher Affiliation Academia Zuan Gao , Yuxin Wang , Yadong Qu , Boqiang Zhang , Zixiao Wang , Jianjun Xu , Hongtao Xie University of Science and Technology of China, Hefei, China {zuangao, qqqyd, cyril, wzx99, xujj1998}@mail.ustc.edu.cn, {wangyx58, htxie}@ustc.edu.cn
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/Faltings A/SSM.
Open Datasets Yes Unlabeled Pre-training Data We utilize the latest unlabeled real-scene dataset Union14M-U for self-supervised learning, which contains 10 million instances collected from Book32, OCR-CC and Open VINO. Besides, we also conduct pretraining on the complete OCR-CC dataset(15.77M unlabeled text images) to facilitate a fair comparison with works such as CCD [Guan et al., 2023] and Di G [Yang et al., 2022]. Text Recognition Fine-tuning Data We use three types of labeled data. 1) STD: The synthetic data, comprising 14M images from MJSynth [Jaderberg et al., 2014] and Synth Text [Gupta et al., 2016].
Dataset Splits No The paper mentions various training datasets (STD, ARD, Union14M-L) and evaluation benchmarks (e.g., Common benchmarks, Union14M benchmarks), but it does not explicitly describe how the data was split into training, validation, and test sets for their experiments, nor does it specify validation set sizes or percentages.
Hardware Specification No The paper mentions 'GPU resource support offered by the MCC Lab of Information Science and Technology Institution, USTC.' but does not specify any particular GPU models, CPU types, or other detailed hardware specifications used for the experiments.
Software Dependencies No The paper mentions various architectural components, optimizers, and loss functions (e.g., 'Vi T Encoder', 'Adam W optimizer', 'L2 loss'), but does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes Self-supervised Pre-training The pre-training is conducted on Vi T, with image resolution of 32 128, an Adam W optimizer, cosine learning rate scheduler with a learning rate of 5e-4, batch size with 1,024, a weight decay of 0.05, β1 = 0.9, β2 = 0.95, and warm-up for 1 epoch in a total of 20 epochs. Text Recognition Fine-Tuning Our text recognition network is fine-tuned with STD or ARD or Union14M-L dataset. Patch size is 4 4. The text decoder consists of a 6-layer transformer block with an embedding dimension of 384. The batch size is 384 and the warm-up time is 1 epoch. The Adam W optimizer and a One Cycle learning rate scheduler with a learning rate of 1e-4 are employed.