Boosting Semi-Supervised Scene Text Recognition via Viewing and Summarizing
Authors: Yadong Qu, Yuxin Wang, Bangbang Zhou, Zixiao Wang, Hongtao Xie, Yongdong Zhang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiment results show that our method achieves SOTA performance (94.7% and 70.9% average accuracy on common benchmarks and Union14M-Benchmark). Code will be available at https://github.com/qqqyd/Vi Su. |
| Researcher Affiliation | Academia | Yadong Qu, Yuxin Wang , Bangbang Zhou, Zixiao Wang, Hongtao Xie, Yongdong Zhang University of Science and Technology of China, Hefei, China {qqqyd, bangzhou01, wzx99}@mail.ustc.edu.cn {wangyx58, htxie, zhyd73}@ustc.edu.cn |
| Pseudocode | No | The paper includes mathematical formulations of loss functions but does not present any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | Code will be available at https://github.com/qqqyd/Vi Su. |
| Open Datasets | Yes | SL includes two widely used synthetic datasets MJSynth [14] and Synth Text [12], which contain 9M and 7M synthetic images. For real data without annotations, we adopt Union14M-U [15] with a total of 10M refined images from Book32 [13], CC [32], and Open Images [18]. |
| Dataset Splits | No | The paper lists benchmark datasets used for evaluation (which typically include test sets) but does not explicitly describe specific train/validation/test splits, provide percentages, or mention a dedicated validation set split. |
| Hardware Specification | Yes | Vi Su is trained on 4 NVIDIA RTX 4090 GPUs. |
| Software Dependencies | No | The paper mentions software components like 'Adam W optimizer' but does not specify version numbers for any programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | All images are resized to 100 32, and the patch size is 8 4. The maximum length T is set to 25. The character set size is 36, including 10 digits and 26 alphabets. For training settings, the network is trained in an end-to-end manner without pre-training. We adopt Adam W optimizer and one-cycle [35] learning rate scheduler with a maximum learning rate of 6e-4. The batchsize is 384 for both synthetic data and real unlabeled data. We set the EMA smoothing factor α = 0.999, aspect ratio thresh r = 1.3, confidence threshold ηccr = 0.5, ηcua = 0.7, and temperature factor τ = 0.1. |