DocFormerv2: Local Features for Document Understanding
Authors: Srikar Appalaraju, Peng Tang, Qi Dong, Nishant Sankaran, Yichu Zhou, R. Manmatha
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Doc Formerv2 when evaluated on nine challenging datasets shows state-of-the-art performance on all over strong baselines On Tab Fact (+4.3%), Info VQA (+1.4%), FUNSD (+1.0%). Furthermore, to show generalization capabilities, on three VQA tasks involving scene-text, Doc Formerv2 outperforms previous comparably-sized models and even does better than much larger models (such as GIT2, Pa LI and Flamingo) on these tasks. Extensive ablations show that due to its novel pre-training tasks, Doc Formerv2 understands multiple modalities better than prior-art in VDU. Experimentally we demonstrate that Doc Formerv2 achieves state-of-the-art performance on five VDU tasks. |
| Researcher Affiliation | Collaboration | Srikar Appalaraju1 *, Peng Tang1, Qi Dong1, Nishant Sankaran1, Yichu Zhou2 , R. Manmatha1 1AWS AI Labs 2School of Computing at University of Utah |
| Pseudocode | No | The paper describes the architecture and tasks with text and diagrams (Figure 3, 4, 5), but it does not include any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not explicitly state that source code for the methodology is released or provide a direct link to a code repository. |
| Open Datasets | Yes | Following prior-art (Appalaraju et al. 2021; Powalski et al. 2021; Biten et al. 2022; Xu et al. 2020a, 2021; Huang et al. 2022) we use the Industrial Document Library (IDL)1 dataset for pre-training. The IDL is a collection of industry documents hosted by UCSF. It hosts millions of documents publicly disclosed from various industries like tobacco, drug, food etc. 1https://www.industrydocuments.ucsf.edu/ |
| Dataset Splits | Yes | Following common practice (Ćukasz Borchmann et al. 2021; Powalski et al. 2021; Xu et al. 2020b), we train Doc Formerv2 on the combination of the training and validation sets and do evaluation on the test set for each dataset. ... For OCR-VQA, we fine-tune our models on the training set and do evaluation on the validation and test sets. For Text VQA and ST-VQA, following the previous state-of-the-art methods (Biten et al. 2022; Yang et al. 2021), we fine-tune our models on the combination of the Text VQA and ST-VQA training sets and do evaluation on the valida- tion and test sets of each dataset. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'Pytorch (Paszke et al. 2019) and the Huggingface library (Thomas et al. 2019)' but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | No | The paper mentions general training aspects such as a 'maximum sequence limit s' and that 'k, l, m are empirically determined' for loss coefficients. However, it does not provide specific hyperparameter values like learning rate, batch size, number of epochs, or optimizer settings in the main text. |