Alignment-Enriched Tuning for Patch-Level Pre-trained Document Image Models
Authors: Lei Wang, Jiabang He, Xing Xu, Ning Liu, Hui Liu
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on various downstream tasks show that AETNet can achieve state-of-the-art performance on various downstream tasks. We evaluate our AETNet method on various downstream document image understanding tasks, including FUNSD (Jaume, Ekenel, and Thiran 2019) for form understanding, CORD (Park et al. 2019) for receipt Understanding, Doc VQA (Mathew, Karatzas, and Jawahar 2021) for document visual question answering, and a sampled subset RVL-CDIP-1 from RVL-CDIP (Harley, Ufkes, and Derpanis 2015) for document image classification. |
| Researcher Affiliation | Collaboration | 1 School of Computer Science and Engineering, University of Electronic Science and Technology of China, China 2 Singapore Management University, Singapore 3 Beijing Forestry University, China 4 Beijing Rongda Technology Co., Ltd., China |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/MAEHCM/AET. |
| Open Datasets | Yes | We evaluate our AETNet method on various downstream document image understanding tasks, including FUNSD (Jaume, Ekenel, and Thiran 2019) for form understanding, CORD (Park et al. 2019) for receipt Understanding, Doc VQA (Mathew, Karatzas, and Jawahar 2021) for document visual question answering, and a sampled subset RVL-CDIP-1 from RVL-CDIP (Harley, Ufkes, and Derpanis 2015) for document image classification. |
| Dataset Splits | Yes | CORD is a receipt key information extraction dataset, including 1,000 receipts and 30 semantic labels defined under 4 categories, where 800 samples are used for training, 100 for validation, and 100 for testing. We follow the official partition of the Doc VQA (Mathew, Karatzas, and Jawahar 2021) dataset, which consists of 10,194/1,286/1,287 images with 39,463/5,349/5,188 questions for training/validation/test, respectively. RVL-CDIP-1 is divided into 8000 training samples, 1000 validation samples, and 1000 test samples. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used to run its experiments, only general statements like "Due to the limitations of our servers" are present, without specific model numbers for GPUs, CPUs, or memory details. |
| Software Dependencies | No | The paper mentions specific software components like RoBERTa, DeiT, and Tesseract, but does not provide specific version numbers for these or any other software dependencies needed for reproducibility. |
| Experiment Setup | No | The paper states: "The detailed description of hyper-parameters, including running epochs, learning rate, batch size, and optimizer, for our method on three downstream tasks and four datasets, are referred to https://github.com/MAEHCM/AET." This means the details are not explicitly provided within the main text of the paper. |