Automatic Assessment of OCR Quality in Historical Documents
Authors: Anshul Gupta, Ricardo Gutierrez-Osuna, Matthew Christy, Boris Capitanu, Loretta Auvil, Liz Grumbach, Richard Furuta, Laura Mandell
AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | When evaluated on a dataset containing over 72,000 manually-labeled BBs from 159 historical documents, the algorithm can classify BBs with 0.95 precision and 0.96 recall. Further evaluation on a collection of 6,775 documents with ground-truth transcriptions shows that the algorithm can also be used to predict document quality (0.7 correlation) and improve OCR transcriptions in 85% of the cases. |
| Researcher Affiliation | Academia | 1Department of Computer Science and Engineering, Texas A&M University 2Initiative for Digital Humanities, Media, and Culture, Texas A&M University 3Illinois Informatics Institute, University of Illinois at Urbana-Champaign |
| Pseudocode | No | The paper describes the algorithms and their steps (e.g., rule-based classifier, iterative relabeling) but does not include any formal pseudocode blocks or sections labeled 'Algorithm'. |
| Open Source Code | No | The paper states, 'all tools used and produced by e MOP must remain free or open-source,' referring to a project policy. It also mentions 'The Tesseract open-source OCR engine' which is a third-party tool. However, it does not provide any specific link or statement confirming the release of the code developed for this paper's methodology. |
| Open Datasets | No | To test the proposed algorithm we generated three separate datasets (see Table 2) consisting of binarized document images from the e MOP collection, carefully selected to represent the variety of documents in the corpora. We evaluated this quality measure on a large dataset of 6,775 document images from the EEBO collection. While datasets are mentioned, there are no specific links, DOIs, repositories, or formal citations for accessing these datasets, nor are they stated to be publicly available. |
| Dataset Splits | Yes | Dataset 1 was used to optimize thresholds in the pre-filtering stage whereas dataset 2 was used to optimize parameters and P in the local iterative relabeling stage. Dataset 3 was used to crossvalidate the MLP and evaluate overall performance. The number of hidden units ( ) was optimized through three-fold cross-validation over dataset 3 with the F1-score as the objective function. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory, cloud instances) used for running the experiments. |
| Software Dependencies | No | The paper states: 'Our pipeline is based on the Tesseract open-source OCR engine available from Google (Smith 2007)'. While Tesseract is mentioned, a specific version number for this software dependency is not provided. |
| Experiment Setup | Yes | The MLP for the iterative process consisted of a hidden layer with 8 tangent-sigmoidal neurons, and 2 output neurons (i.e., one per class) with soft-max activation function to ensure MLP outputs could be interpreted as probabilities. Parameter in eq. (1), the maximum number of neighbors, was set to 84 (21 per vertex), and parameter in eq. (2), was set to 10. |