Alignment-Enriched Tuning for Patch-Level Pre-trained Document Image Models

Authors: Lei Wang, Jiabang He, Xing Xu, Ning Liu, Hui Liu

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on various downstream tasks show that AETNet can achieve state-of-the-art performance on various downstream tasks. We evaluate our AETNet method on various downstream document image understanding tasks, including FUNSD (Jaume, Ekenel, and Thiran 2019) for form understanding, CORD (Park et al. 2019) for receipt Understanding, Doc VQA (Mathew, Karatzas, and Jawahar 2021) for document visual question answering, and a sampled subset RVL-CDIP-1 from RVL-CDIP (Harley, Ufkes, and Derpanis 2015) for document image classification.
Researcher Affiliation Collaboration 1 School of Computer Science and Engineering, University of Electronic Science and Technology of China, China 2 Singapore Management University, Singapore 3 Beijing Forestry University, China 4 Beijing Rongda Technology Co., Ltd., China
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/MAEHCM/AET.
Open Datasets Yes We evaluate our AETNet method on various downstream document image understanding tasks, including FUNSD (Jaume, Ekenel, and Thiran 2019) for form understanding, CORD (Park et al. 2019) for receipt Understanding, Doc VQA (Mathew, Karatzas, and Jawahar 2021) for document visual question answering, and a sampled subset RVL-CDIP-1 from RVL-CDIP (Harley, Ufkes, and Derpanis 2015) for document image classification.
Dataset Splits Yes CORD is a receipt key information extraction dataset, including 1,000 receipts and 30 semantic labels defined under 4 categories, where 800 samples are used for training, 100 for validation, and 100 for testing. We follow the official partition of the Doc VQA (Mathew, Karatzas, and Jawahar 2021) dataset, which consists of 10,194/1,286/1,287 images with 39,463/5,349/5,188 questions for training/validation/test, respectively. RVL-CDIP-1 is divided into 8000 training samples, 1000 validation samples, and 1000 test samples.
Hardware Specification No The paper does not explicitly describe the hardware used to run its experiments, only general statements like "Due to the limitations of our servers" are present, without specific model numbers for GPUs, CPUs, or memory details.
Software Dependencies No The paper mentions specific software components like RoBERTa, DeiT, and Tesseract, but does not provide specific version numbers for these or any other software dependencies needed for reproducibility.
Experiment Setup No The paper states: "The detailed description of hyper-parameters, including running epochs, learning rate, batch size, and optimizer, for our method on three downstream tasks and four datasets, are referred to https://github.com/MAEHCM/AET." This means the details are not explicitly provided within the main text of the paper.