Nougat: Neural Optical Understanding for Academic Documents

Authors: Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose Nougat (Neural Optical Understanding for Academic Documents), a Visual Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language, and demonstrate the effectiveness of our model on a new dataset of scientific documents. The proposed approach offers a promising solution to enhance the accessibility of scientific knowledge in the digital age, by bridging the gap between human-readable documents and machine-readable text. We release the models and code to accelerate future work on scientific text recognition. 5 RESULTS & EVALUATION In this section we discuss the results and performance of the model. For an example see Fig. 4 or go to Sec. B. The model focuses only on the important content relevant features of the page. The box around the equations is skipped. 5.1 METRICS We report the following metrics on our test set. Character Error Rate The character error rate (CER), or normalized Levenshtein distance (Levenshtein, 1965), measures the number of character manipulations (insertions, deletions, substitutions) it takes to get from one string to another. BLEU The BLEU (Papineni et al., 2002) metric was originally introduced for measuring the quality of text that has been machine-translated from one language to another. The metric computes a score based on the number of matching n-grams between the candidate and reference sentence. METEOR Another machine-translating metric with a focus on recall instead of precision, introduced in (Banerjee & Lavie, 2005). F-measure We also compute the F1-score and report the precision and recall. Table 1: Results on ar Xiv test set.
Researcher Affiliation Industry Correspondence to: lblecher@meta.com
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes We release the models and code to accelerate future work on scientific text recognition. We release the code and the model on Git Hub2 https://github.com/facebookresearch/nougat
Open Datasets Yes To the best of our knowledge there is no paired dataset of PDF pages and corresponding source code out there, so we created our own from the open access articles on ar Xiv.3 For layout diversity we also include a subset of the Pub Med Central 4 (PMC) open access non-commercial dataset. During the pretraining, a portion of the Industry Documents Library 5 (IDL) is included. 3https://arxiv.org/ 4https://www.ncbi.nlm.nih.gov/pmc/ 5https://www.industrydocuments.ucsf.edu/
Dataset Splits No The paper mentions 'test set' for its evaluation but does not explicitly describe a 'validation set' split for its own experiments on the created dataset. Mentions of 'validation split' are in sections discussing other papers' results or general benchmarks, not their own.
Hardware Specification Yes On a machine with a NVIDIA A10G graphics card with 24GB VRAM we can process 6 pages in parallel.
Software Dependencies No The paper mentions specific software components like 'Adam W optimizer (Loshchilov & Hutter, 2019)' and the 'Albumentations (Buslaev et al., 2020) library', but it does not provide specific version numbers for these, which are required for a reproducible description of ancillary software.
Experiment Setup Yes Training We use an Adam W optimizer (Loshchilov & Hutter, 2019) to train for 3 epochs with an effective batch size of 192. Due to training instabilities, we choose a learning rate of lrinit = 5 10 5 which is reduced by a factor of 0.9996 every 15 updates until it reaches lrend = 7.5 10 6. The visual encoder ... resizes the image to fit in a fixed rectangle of size (H, W). If the image is smaller than the rectangle, additional padding is added to ensure each image has the same dimensionality. We use a Swin Transformer (Liu et al., 2021) ... The Transformer decoder has a maximal sequence length of S = 4096. ... The BART decoder is a decoder-only transformer with 10 layers. The entire architecture has a total of 350M parameters. We also test experiment with a smaller model (250M parameters) with a slightly smaller sequence length of S = 3584 and only 4 decoder layers, where we start from the pre-trained base model. During inference the text is generated using greedy decoding.