TrOCR: Transformer-Based Optical Character Recognition with Pre-trained Models

Authors: Minghao Li, Tengchao Lv, Jingye Chen, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that the Tr OCR model outperforms the current state-of-the-art models on the printed, handwritten and scene text recognition tasks. The Tr OCR models and code are publicly available at https://aka.ms/trocr.
Researcher Affiliation Collaboration 1Beihang University 2Microsoft Corporation
Pseudocode No The paper describes the model architecture and pipeline in text and diagrams (Figure 1) but does not provide pseudocode or algorithm blocks.
Open Source Code Yes The Tr OCR models and code are publicly available at https://aka.ms/trocr.
Open Datasets Yes To build a large-scale high-quality dataset, we sample two million document pages from the publicly available PDF files on the Internet. ... We use 5,427 handwritten fonts to synthesize handwritten textline images by the TRDG2, an open-source text recognition data generator. The text used for generation is crawled from random pages of Wikipedia. ... The second-stage pre-training data for the scene text recognition are MJSynth (MJ) (Jaderberg et al. 2014) and Synth Text (ST) (Gupta, Vedaldi, and Zisserman 2016), totaling about 16M text images.
Dataset Splits Yes The IAM Handwriting Database is composed of handwritten English text, which is the most popular dataset for handwritten text recognition. We use the Aachen’s partition of the dataset3: 6,161 lines from 747 forms in the train set, 966 lines from 115 forms in the validation set and 2,915 lines from 336 forms in the test set.
Hardware Specification Yes We use 32 V100 GPUs with the memory of 32GBs for pre-training and 8 V100 GPUs for fine-tuning.
Software Dependencies No The paper mentions software like Fairseq, timm library, UniLM, and PyTorch, but does not specify their version numbers, which are necessary for reproducible software dependencies.
Experiment Setup Yes For all the models, the batch size is set to 2,048 and the learning rate is 5e-5. ... We employ the 384 × 384 resolution and 16 × 16 patch size for Dei T and BEi T encoders. The Dei TSMALL has 12 layers with 384 hidden sizes and 6 heads. Both the Dei TBASE and the BEi TBASE have 12 layers with 768 hidden sizes and 12 heads while the BEi TLARGE has 24 layers with 1024 hidden sizes and 16 heads. We use 6 layers, 256 hidden sizes and 8 attention heads for the small decoders, 512 hidden sizes for the base decoders and 12 layers, 1,024 hidden sizes and 16 heads for the large decoders. For this task, we only use the last half of all layers from the corresponding Ro BERTa model, which are the last 6 layers for the Ro BERTa BASE and the last 12 layers for the Ro BERTa LARGE. The beam size is set to 10 for Tr OCR models.