UniDoc: Unified Pretraining Framework for Document Understanding

Authors: Jiuxiang Gu, Jason Kuen, Vlad I Morariu, Handong Zhao, Rajiv Jain, Nikolaos Barmpalios, Ani Nenkova, Tong Sun

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive empirical analysis demonstrates that the pretraining procedure learns better joint representations and leads to improvements in downstream tasks. Extensive experiments and analysis provide useful insights on the effectiveness of the pretraining tasks and show outstanding performance on various downstream tasks.
Researcher Affiliation Industry Jiuxiang Gu1, Jason Kuen1, Vlad I. Morariu1, Handong Zhao1, Nikolaos Barmpalios2, Rajiv Jain1, Ani Nenkova1, Tong Sun1 1Adobe Research, 2Adobe Document Cloud {jigu,kuen,morariu,hazhao,barmpali,rajijain,nenkova,tsun}@adobe.com
Pseudocode No The paper does not contain a structured pseudocode block or algorithm block. Figure 1 is a diagram, not pseudocode.
Open Source Code No The paper does not provide an explicit statement or a direct link to the source code for the UDoc methodology developed by the authors. It only provides links to third-party tools like EasyOCR and Detectron2, which were used in their work.
Open Datasets Yes We build our pretraining corpus based on IIT-CDIP Test Collection 1.0 [26], which contains more than 11M scanned document images. We use FUNSD [27] as the evaluation dataset. The performance on this task is evaluated on CORD [7] dataset. We use RVL-CDIP [8] as the target dataset. We evaluate the effectiveness of our pretrained visual backbone on Pub Lay Net [28].
Dataset Splits Yes Form Understanding. It contains 149/50 training/testing images. Receipt Understanding. It contains 626/247 receipts for training/testing. Document Classification. It consists of 320K/40K/40K training/validation/testing images under 16 categories.
Hardware Specification Yes The pretraining is conducted on 8 NVIDIA Tesla V100 32GB GPUs with a batch size of 64.
Software Dependencies No The paper mentions several software components and tools used (Easy OCR, Detectron2, BERT-NLI-STSb-base, Adam optimizer, Tesseract, Google OCR) but does not provide specific version numbers for these, which is required for reproducibility.
Experiment Setup Yes The pretraining is conducted on 8 NVIDIA Tesla V100 32GB GPUs with a batch size of 64. It is trained with Adam optimizer [32], with an initial learning rate of 10 5, weight decay of 10 4, and learning rate warmup in the first 20% iterations. For MSM, we set the mask probability ps Mask for input sentences to 15%. For VCL, the λ is set to 0.1, κ is set to 0.1, the mask probability pv Mask is set to 7.5% and the masked Ro I features are filled with zeros. The temperature τ is annealed from 2.0 to 0.5 by a factor of of 0.999995 at every iteration.