reproducibilityindex.ai

TrOCR: Transformer-Based Optical Character Recognition with Pre-trained Models

Authors: Minghao Li, Tengchao Lv, Jingye Chen, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that the Tr OCR model outperforms the current state-of-the-art models on the printed, handwritten and scene text recognition tasks. The Tr OCR models and code are publicly available at https://aka.ms/trocr.
Researcher Affiliation	Collaboration	1Beihang University 2Microsoft Corporation
Pseudocode	No	The paper describes the model architecture and pipeline in text and diagrams (Figure 1) but does not provide pseudocode or algorithm blocks.
Open Source Code	Yes	The Tr OCR models and code are publicly available at https://aka.ms/trocr.
Open Datasets	Yes	To build a large-scale high-quality dataset, we sample two million document pages from the publicly available PDF files on the Internet. ... We use 5,427 handwritten fonts to synthesize handwritten textline images by the TRDG2, an open-source text recognition data generator. The text used for generation is crawled from random pages of Wikipedia. ... The second-stage pre-training data for the scene text recognition are MJSynth (MJ) (Jaderberg et al. 2014) and Synth Text (ST) (Gupta, Vedaldi, and Zisserman 2016), totaling about 16M text images.
Dataset Splits	Yes	The IAM Handwriting Database is composed of handwritten English text, which is the most popular dataset for handwritten text recognition. We use the Aachen’s partition of the dataset3: 6,161 lines from 747 forms in the train set, 966 lines from 115 forms in the validation set and 2,915 lines from 336 forms in the test set.
Hardware Specification	Yes	We use 32 V100 GPUs with the memory of 32GBs for pre-training and 8 V100 GPUs for fine-tuning.
Software Dependencies	No	The paper mentions software like Fairseq, timm library, UniLM, and PyTorch, but does not specify their version numbers, which are necessary for reproducible software dependencies.
Experiment Setup	Yes	For all the models, the batch size is set to 2,048 and the learning rate is 5e-5. ... We employ the 384 × 384 resolution and 16 × 16 patch size for Dei T and BEi T encoders. The Dei TSMALL has 12 layers with 384 hidden sizes and 6 heads. Both the Dei TBASE and the BEi TBASE have 12 layers with 768 hidden sizes and 12 heads while the BEi TLARGE has 24 layers with 1024 hidden sizes and 16 heads. We use 6 layers, 256 hidden sizes and 8 attention heads for the small decoders, 512 hidden sizes for the base decoders and 12 layers, 1,024 hidden sizes and 16 heads for the large decoders. For this task, we only use the last half of all layers from the corresponding Ro BERTa model, which are the last 6 layers for the Ro BERTa BASE and the last 12 layers for the Ro BERTa LARGE. The beam size is set to 10 for Tr OCR models.