reproducibilityindex.ai

Alignment-Enriched Tuning for Patch-Level Pre-trained Document Image Models

Authors: Lei Wang, Jiabang He, Xing Xu, Ning Liu, Hui Liu

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on various downstream tasks show that AETNet can achieve state-of-the-art performance on various downstream tasks. We evaluate our AETNet method on various downstream document image understanding tasks, including FUNSD (Jaume, Ekenel, and Thiran 2019) for form understanding, CORD (Park et al. 2019) for receipt Understanding, Doc VQA (Mathew, Karatzas, and Jawahar 2021) for document visual question answering, and a sampled subset RVL-CDIP-1 from RVL-CDIP (Harley, Ufkes, and Derpanis 2015) for document image classiﬁcation.
Researcher Affiliation	Collaboration	1 School of Computer Science and Engineering, University of Electronic Science and Technology of China, China 2 Singapore Management University, Singapore 3 Beijing Forestry University, China 4 Beijing Rongda Technology Co., Ltd., China
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/MAEHCM/AET.
Open Datasets	Yes	We evaluate our AETNet method on various downstream document image understanding tasks, including FUNSD (Jaume, Ekenel, and Thiran 2019) for form understanding, CORD (Park et al. 2019) for receipt Understanding, Doc VQA (Mathew, Karatzas, and Jawahar 2021) for document visual question answering, and a sampled subset RVL-CDIP-1 from RVL-CDIP (Harley, Ufkes, and Derpanis 2015) for document image classiﬁcation.
Dataset Splits	Yes	CORD is a receipt key information extraction dataset, including 1,000 receipts and 30 semantic labels deﬁned under 4 categories, where 800 samples are used for training, 100 for validation, and 100 for testing. We follow the ofﬁcial partition of the Doc VQA (Mathew, Karatzas, and Jawahar 2021) dataset, which consists of 10,194/1,286/1,287 images with 39,463/5,349/5,188 questions for training/validation/test, respectively. RVL-CDIP-1 is divided into 8000 training samples, 1000 validation samples, and 1000 test samples.
Hardware Specification	No	The paper does not explicitly describe the hardware used to run its experiments, only general statements like "Due to the limitations of our servers" are present, without specific model numbers for GPUs, CPUs, or memory details.
Software Dependencies	No	The paper mentions specific software components like RoBERTa, DeiT, and Tesseract, but does not provide specific version numbers for these or any other software dependencies needed for reproducibility.
Experiment Setup	No	The paper states: "The detailed description of hyper-parameters, including running epochs, learning rate, batch size, and optimizer, for our method on three downstream tasks and four datasets, are referred to https://github.com/MAEHCM/AET." This means the details are not explicitly provided within the main text of the paper.