DocFormerv2: Local Features for Document Understanding

Authors: Srikar Appalaraju, Peng Tang, Qi Dong, Nishant Sankaran, Yichu Zhou, R. Manmatha

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Doc Formerv2 when evaluated on nine challenging datasets shows state-of-the-art performance on all over strong baselines On Tab Fact (+4.3%), Info VQA (+1.4%), FUNSD (+1.0%). Furthermore, to show generalization capabilities, on three VQA tasks involving scene-text, Doc Formerv2 outperforms previous comparably-sized models and even does better than much larger models (such as GIT2, Pa LI and Flamingo) on these tasks. Extensive ablations show that due to its novel pre-training tasks, Doc Formerv2 understands multiple modalities better than prior-art in VDU. Experimentally we demonstrate that Doc Formerv2 achieves state-of-the-art performance on five VDU tasks.
Researcher Affiliation Collaboration Srikar Appalaraju1 *, Peng Tang1, Qi Dong1, Nishant Sankaran1, Yichu Zhou2 , R. Manmatha1 1AWS AI Labs 2School of Computing at University of Utah
Pseudocode No The paper describes the architecture and tasks with text and diagrams (Figure 3, 4, 5), but it does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not explicitly state that source code for the methodology is released or provide a direct link to a code repository.
Open Datasets Yes Following prior-art (Appalaraju et al. 2021; Powalski et al. 2021; Biten et al. 2022; Xu et al. 2020a, 2021; Huang et al. 2022) we use the Industrial Document Library (IDL)1 dataset for pre-training. The IDL is a collection of industry documents hosted by UCSF. It hosts millions of documents publicly disclosed from various industries like tobacco, drug, food etc. 1https://www.industrydocuments.ucsf.edu/
Dataset Splits Yes Following common practice (Ɓukasz Borchmann et al. 2021; Powalski et al. 2021; Xu et al. 2020b), we train Doc Formerv2 on the combination of the training and validation sets and do evaluation on the test set for each dataset. ... For OCR-VQA, we fine-tune our models on the training set and do evaluation on the validation and test sets. For Text VQA and ST-VQA, following the previous state-of-the-art methods (Biten et al. 2022; Yang et al. 2021), we fine-tune our models on the combination of the Text VQA and ST-VQA training sets and do evaluation on the valida- tion and test sets of each dataset.
Hardware Specification No The paper does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions using 'Pytorch (Paszke et al. 2019) and the Huggingface library (Thomas et al. 2019)' but does not provide specific version numbers for these software dependencies.
Experiment Setup No The paper mentions general training aspects such as a 'maximum sequence limit s' and that 'k, l, m are empirically determined' for loss coefficients. However, it does not provide specific hyperparameter values like learning rate, batch size, number of epochs, or optimizer settings in the main text.