HRDoc: Dataset and Baseline Method toward Hierarchical Reconstruction of Document Structures

Authors: Jiefeng Ma, Jun Du, Pengfei Hu, Zhenrong Zhang, Jianshu Zhang, Huihui Zhu, Cong Liu

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To better evaluate the system performance on the new task, we built a large-scale dataset named HRDoc, which consists of 2,500 multi-page documents with nearly 2 million semantic units. Moreover, we proposed an encoder-decoder-based hierarchical document structure parsing system (DSPS) to tackle this problem. By adopting a multi-modal bidirectional encoder and a structure-aware GRU decoder with soft-mask operation, the DSPS model surpass the baseline method by a large margin. All scripts and datasets will be made publicly available at https://github.com/jfma-USTC/HRDoc.
Researcher Affiliation Collaboration Jiefeng Ma1, Jun Du1*, Pengfei Hu1, Zhenrong Zhang1, Jianshu Zhang2, Huihui Zhu2, Cong Liu2 1NERC-SLIP, University of Science and Technology of China 2i FLYTEK Research
Pseudocode No The paper includes diagrams and descriptions of the proposed system but does not provide explicit pseudocode or algorithm blocks.
Open Source Code Yes All scripts and datasets will be made publicly available at https://github.com/jfma-USTC/HRDoc.
Open Datasets Yes All scripts and datasets will be made publicly available at https://github.com/jfma-USTC/HRDoc.
Dataset Splits No The paper mentions 'train and test set' in Figure 3's description: 'Figure 3 provides the statistics of semantic unit distribution over the train and test set of both HRDS and HRDH datasets.' However, it does not explicitly mention a 'validation set' or a split for validation.
Hardware Specification No The paper does not provide specific details about the hardware used for the experiments, such as GPU models, CPU types, or memory.
Software Dependencies No The paper mentions software like 'PDFPlumber' and 'Py Mu PDF' and machine learning models/frameworks like 'Sentence-Bert', 'Res Net-50', 'FPN', 'Ro IAlign', 'Transformer', 'GRU', 'Layer normalization', 'Focal Loss', and 'Cascade RCNN', but it does not specify any version numbers for these software components or libraries.
Experiment Setup No The paper describes the overall architecture and components of the DSPS model (e.g., encoder, decoder, relation classifier) and the subtasks considered. However, it does not explicitly state specific experimental setup details such as hyperparameter values (e.g., learning rate, batch size, number of epochs) or optimizer settings.