ADOPD: A Large-Scale Document Page Decomposition Dataset

Authors: Jiuxiang Gu, Xiangxi Shi, Jason Kuen, Lu Qi, Ruiyi Zhang, Anqi Liu, Ani Nenkova, Tong Sun

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct comprehensive experimental analyses to validate our data and assess the four tasks using various models.
Researcher Affiliation Collaboration 1Adobe Research 2Oregon State University 3UC, Merced 4Johns Hopkins University
Pseudocode Yes Alg. 1 outlines the process where we integrate outlier detection for data collection and taxonomy discovery.
Open Source Code No The paper provides a link to a project page (https://adopd2024.github.io) but not a direct link to a source code repository or an explicit statement about the release of source code for the methodology.
Open Datasets Yes The images in ADOPD are sourced from the Laion-HR (Laion High Resolution), which comprises high-resolution web images, including multilingual document images. Laion High Resolution. Laion. https://huggingface.co/datasets/laion/ laion-high-resolution. 2023.
Dataset Splits Yes We experiment on the subset of ADOPD, with training and validation sets comprising 50k and 10k images, respectively.
Hardware Specification Yes All experiments are run on NVIDIA A100-80GB GPUs.
Software Dependencies No The paper mentions software like Detectron2, MMDetection, and Huggingface Transformers, but does not provide specific version numbers for these dependencies.
Experiment Setup Yes Following standard practices(Ghiasi et al., 2021), we employ an input resolution of 1024 × 1024, achieved by re-scaling and padding the shorter side of the image. Doc2Mask (Crop Former and Mask2Former) and Doc2Box (Faster R-CNN, Cascade Mask-RCNN) are trained for 15 epochs with a batch size of 32 on 8 GPUs to achieve full convergence. We train Deformable-DETR for 30 epochs due to slow convergence issues. For Doc2Seq, we train it for 50 epochs on 8 GPUs with a total batch size of 800. Finetuning CLIP Vi T-G/14 on Doc2Seq data takes 100 epochs on 8x8 GPUs.