Boosting Semi-Supervised Scene Text Recognition via Viewing and Summarizing

Authors: Yadong Qu, Yuxin Wang, Bangbang Zhou, Zixiao Wang, Hongtao Xie, Yongdong Zhang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiment results show that our method achieves SOTA performance (94.7% and 70.9% average accuracy on common benchmarks and Union14M-Benchmark). Code will be available at https://github.com/qqqyd/Vi Su.
Researcher Affiliation Academia Yadong Qu, Yuxin Wang , Bangbang Zhou, Zixiao Wang, Hongtao Xie, Yongdong Zhang University of Science and Technology of China, Hefei, China {qqqyd, bangzhou01, wzx99}@mail.ustc.edu.cn {wangyx58, htxie, zhyd73}@ustc.edu.cn
Pseudocode No The paper includes mathematical formulations of loss functions but does not present any pseudocode or clearly labeled algorithm blocks.
Open Source Code No Code will be available at https://github.com/qqqyd/Vi Su.
Open Datasets Yes SL includes two widely used synthetic datasets MJSynth [14] and Synth Text [12], which contain 9M and 7M synthetic images. For real data without annotations, we adopt Union14M-U [15] with a total of 10M refined images from Book32 [13], CC [32], and Open Images [18].
Dataset Splits No The paper lists benchmark datasets used for evaluation (which typically include test sets) but does not explicitly describe specific train/validation/test splits, provide percentages, or mention a dedicated validation set split.
Hardware Specification Yes Vi Su is trained on 4 NVIDIA RTX 4090 GPUs.
Software Dependencies No The paper mentions software components like 'Adam W optimizer' but does not specify version numbers for any programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes All images are resized to 100 32, and the patch size is 8 4. The maximum length T is set to 25. The character set size is 36, including 10 digits and 26 alphabets. For training settings, the network is trained in an end-to-end manner without pre-training. We adopt Adam W optimizer and one-cycle [35] learning rate scheduler with a maximum learning rate of 6e-4. The batchsize is 384 for both synthetic data and real unlabeled data. We set the EMA smoothing factor α = 0.999, aspect ratio thresh r = 1.3, confidence threshold ηccr = 0.5, ηcua = 0.7, and temperature factor τ = 0.1.