HRDoc: Dataset and Baseline Method toward Hierarchical Reconstruction of Document Structures
Authors: Jiefeng Ma, Jun Du, Pengfei Hu, Zhenrong Zhang, Jianshu Zhang, Huihui Zhu, Cong Liu
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To better evaluate the system performance on the new task, we built a large-scale dataset named HRDoc, which consists of 2,500 multi-page documents with nearly 2 million semantic units. Moreover, we proposed an encoder-decoder-based hierarchical document structure parsing system (DSPS) to tackle this problem. By adopting a multi-modal bidirectional encoder and a structure-aware GRU decoder with soft-mask operation, the DSPS model surpass the baseline method by a large margin. All scripts and datasets will be made publicly available at https://github.com/jfma-USTC/HRDoc. |
| Researcher Affiliation | Collaboration | Jiefeng Ma1, Jun Du1*, Pengfei Hu1, Zhenrong Zhang1, Jianshu Zhang2, Huihui Zhu2, Cong Liu2 1NERC-SLIP, University of Science and Technology of China 2i FLYTEK Research |
| Pseudocode | No | The paper includes diagrams and descriptions of the proposed system but does not provide explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | All scripts and datasets will be made publicly available at https://github.com/jfma-USTC/HRDoc. |
| Open Datasets | Yes | All scripts and datasets will be made publicly available at https://github.com/jfma-USTC/HRDoc. |
| Dataset Splits | No | The paper mentions 'train and test set' in Figure 3's description: 'Figure 3 provides the statistics of semantic unit distribution over the train and test set of both HRDS and HRDH datasets.' However, it does not explicitly mention a 'validation set' or a split for validation. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for the experiments, such as GPU models, CPU types, or memory. |
| Software Dependencies | No | The paper mentions software like 'PDFPlumber' and 'Py Mu PDF' and machine learning models/frameworks like 'Sentence-Bert', 'Res Net-50', 'FPN', 'Ro IAlign', 'Transformer', 'GRU', 'Layer normalization', 'Focal Loss', and 'Cascade RCNN', but it does not specify any version numbers for these software components or libraries. |
| Experiment Setup | No | The paper describes the overall architecture and components of the DSPS model (e.g., encoder, decoder, relation classifier) and the subtasks considered. However, it does not explicitly state specific experimental setup details such as hyperparameter values (e.g., learning rate, batch size, number of epochs) or optimizer settings. |