UPOCR: Towards Unified Pixel-Level OCR Interface
Authors: Dezhi Peng, Zhenhua Yang, Jiaxin Zhang, Chongyu Liu, Yongxin Shi, Kai Ding, Fengjun Guo, Lianwen Jin
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments are conducted on three pixel-level OCR tasks including text removal, text segmentation, and tampered text detection. Without bells and whistles, the experimental results showcase that the proposed method can simultaneously achieve state-of-the-art performance on three tasks with a unified single model, which provides valuable strategies and insights for future research on generalist OCR models. |
| Researcher Affiliation | Collaboration | 1South China University of Technology 2INTSIG-SCUT Joint Lab of Document Image Analysis and Recognition 3INTSIG Information Co. Ltd. Correspondence to: Lianwen Jin <eelwjin@scut.edu.cn>. |
| Pseudocode | No | The paper describes the methodology in prose and figures (Table 7 for network architecture details) but does not provide pseudocode or a clearly labeled algorithm block. |
| Open Source Code | Yes | Code is available at https://github.com/ shannanyinxiang/UPOCR. |
| Open Datasets | Yes | The SCUT-Ens Text (Liu et al., 2020), Text Seg (Xu et al., 2021), and Tampered-IC13 (Wang et al., 2022b) datasets are employed for these three tasks, respectively. |
| Dataset Splits | Yes | Text Seg is a large-scale fine-annotated text segmentation dataset with 4,024 images of scene text and design text. The training, validating, and testing sets contain 2,646, 340, and 1,038 samples, respectively. |
| Hardware Specification | Yes | The training lasts approximately 36 hours using two NVIDIA A100 GPUs with 80GB memory. |
| Software Dependencies | No | The proposed UPOCR is implemented with Py Torch4. (footnote points to https://pytorch.org/) The paper mentions PyTorch but does not specify a version number or other software dependencies with their versions. |
| Experiment Setup | Yes | The input size is set to 512 × 512. The model is optimized for 80,000 iterations with a batch size of 48 using an Adam W (Loshchilov & Hutter, 2019) optimizer in a multi-task fashion. The learning rate is initialized as 0.0005 and linearly decays per 200 iterations, finally reaching 0.00001 at the last 200 iterations. |