UPOCR: Towards Unified Pixel-Level OCR Interface

Authors: Dezhi Peng, Zhenhua Yang, Jiaxin Zhang, Chongyu Liu, Yongxin Shi, Kai Ding, Fengjun Guo, Lianwen Jin

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments are conducted on three pixel-level OCR tasks including text removal, text segmentation, and tampered text detection. Without bells and whistles, the experimental results showcase that the proposed method can simultaneously achieve state-of-the-art performance on three tasks with a unified single model, which provides valuable strategies and insights for future research on generalist OCR models.
Researcher Affiliation Collaboration 1South China University of Technology 2INTSIG-SCUT Joint Lab of Document Image Analysis and Recognition 3INTSIG Information Co. Ltd. Correspondence to: Lianwen Jin <eelwjin@scut.edu.cn>.
Pseudocode No The paper describes the methodology in prose and figures (Table 7 for network architecture details) but does not provide pseudocode or a clearly labeled algorithm block.
Open Source Code Yes Code is available at https://github.com/ shannanyinxiang/UPOCR.
Open Datasets Yes The SCUT-Ens Text (Liu et al., 2020), Text Seg (Xu et al., 2021), and Tampered-IC13 (Wang et al., 2022b) datasets are employed for these three tasks, respectively.
Dataset Splits Yes Text Seg is a large-scale fine-annotated text segmentation dataset with 4,024 images of scene text and design text. The training, validating, and testing sets contain 2,646, 340, and 1,038 samples, respectively.
Hardware Specification Yes The training lasts approximately 36 hours using two NVIDIA A100 GPUs with 80GB memory.
Software Dependencies No The proposed UPOCR is implemented with Py Torch4. (footnote points to https://pytorch.org/) The paper mentions PyTorch but does not specify a version number or other software dependencies with their versions.
Experiment Setup Yes The input size is set to 512 × 512. The model is optimized for 80,000 iterations with a batch size of 48 using an Adam W (Loshchilov & Hutter, 2019) optimizer in a multi-task fashion. The learning rate is initialized as 0.0005 and linearly decays per 200 iterations, finally reaching 0.00001 at the last 200 iterations.