DocParser: Hierarchical Document Structure Parsing from Renderings
Authors: Johannes Rausch, Octavio Martinez, Fabian Bissig, Ce Zhang, Stefan Feuerriegel4328-4338
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments confirm the effectiveness of our proposed weak supervision: Compared to the baseline without weak supervision, it improves the mean average precision for detecting document entities by 39.1 % and improves the F1 score of classifying hierarchical relations by 35.8 %. |
| Researcher Affiliation | Academia | Johannes Rausch,1 Octavio Martinez,1 Fabian Bissig,1 Ce Zhang,1 Stefan Feuerriegel2 1 Department of Computer Science, ETH Zurich 2 Department of Management, Technology, and Economics, ETH Zurich |
| Pseudocode | Yes | Details on our parameter choice and pseudocode are included in the supplements. |
| Open Source Code | Yes | Source codes and the ar Xivdocs dataset are available from https://github.com/DS3Lab/DocParser. |
| Open Datasets | Yes | We contribute the dataset ar Xivdocs that is tailored to the task of hierarchical structure parsing. It comes in two variants: ar Xivdocs-target and ar Xivdocs-weak. (1) ar Xivdocs-target contains documents that have been manually checked and annotated. (2) ar Xivdocs-weak contains a large-scale set of documents that have no manual annotations but that can be used for weak supervision. Source codes and the ar Xivdocs dataset are available from https://github.com/DS3Lab/DocParser. |
| Dataset Splits | Yes | ar Xivdocs-target comes with predefined splits for training, validation, and eval that consist of 160, 79, 123 documents, respectively. |
| Hardware Specification | Yes | We train all models in a multi-GPU setting, using 8 GPUs with a v RAM of 12 GB. Each GPU was fed with one image per training iteration. Our system requires only 340 ms/document during entity detection (averaged over our validation set of 79 documents for Doc Parser WS+FT) on a single Titan Xp GPU with 12 GB VRAM and a batch size of 1. |
| Software Dependencies | No | Our implementation makes use of 23 categories C: CONTENT BLOCK, TABLE, TABLE ROW, TABLE COLUMN, TABLE CELL, TABULAR, FIGURE, HEADING, ABSTRACT, EQUATION, ITEMIZE, ITEM, BIBLIOGRAPHY BLOCK, TABLE CAPTION, FIGURE GRAPHIC, FIGURE CAPTION, HEADER, FOOTER, PAGE NUMBER, DATE, KEYWORDS, AUTHOR, AFFILIATION. Mask R-CNN... is built upon the implementation of Mask RCNN provided by Abdulla (2017). The reference Abdulla (2017) title indicates 'Keras and Tensor Flow', but no specific version numbers are provided for these or other software dependencies. |
| Experiment Setup | Yes | All neural models are initialized with pre-trained weights based on the MS COCO dataset (Lin et al. 2014). We then train each model across three phases for a total of 80,000 iterations. This is split into three phases of 20,000, 40,000, and 20,000 iterations, respectively. During the first phase, we freeze all layers of the CNN that is used as the initial block in Mask R-CNN. In the second phase, stages four and five of the CNN are unfrozen. In the last phase, all network layers are trainable. Early stopping is applied based on the performance on the validation set for unrefined predictions. The performance is measured every 2000 iterations via the so-called intersection over union with a threshold of 0.8. We train all models in a multi-GPU setting, using 8 GPUs with a v RAM of 12 GB. Each GPU was fed with one image per training iteration. Accordingly, the batch size per training iteration is set to 8. Furthermore, we use stochastic gradient descent with a learning rate of 0.001 and learning momentum of 0.9. Parameter Settings: During training, we sampled randomly 100 entities from the ground truth per document image (i. e., up to 100 entities as some document images might have fewer). In Mask R-CNN, the maximum number of entity predictions per image is set to 200. During prediction, we only keep entities with a confidence score Pj of 0.7 or higher. |