StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training

Authors: Yuechen Yu, Yulin Li, Chengquan Zhang, Xiaoqiang Zhang, Zengyuan Guo, Xiameng Qin, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on mainstream benchmarks of document image understanding demonstrate the effectiveness of StrucTexTv2. It achieves competitive or even new state-of-the-art performance in various downstream tasks such as image classification, layout analysis, table structure recognition, document OCR, and information extraction under the end-to-end scenario.
Researcher Affiliation Industry Yuechen Yu , Yulin Li , Chengquan Zhang , Xiaoqiang Zhang, Zengyuan Guo, Xiameng Qin, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang Department of Computer Vision Technology (VIS), Baidu Inc.
Pseudocode No The paper describes the architecture and processes in text and with diagrams, but it does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes https://github.com/PaddlePaddle/VIMER/tree/main/StrucTexT
Open Datasets Yes By following DiT Li et al. (2022), we pretrain StrucTexTv2 on the IIT-CDIP Test Collection 1.0 dataset Lewis et al. (2006), whose 11 million multi-page documents are split into single pages, totally 42 million document images. RVL-CDIP Harley et al. (2015) contains 400,000 grayscale document images in 16 classes... Pub Lay Net Zhong et al. (2019) consists of more than 360,000 paper images... WTW Long et al. (2021) covers unconstrained table in natural scene... FUNSD Jaume et al. (2019) is a form understanding dataset that contains 199 forms...
Dataset Splits Yes We evaluate on the validation set of Pub Lay Net for document layout analysis. We fine-tune the model on RVL-CDIP for 20 epochs with cross-entropy loss.
Hardware Specification Yes The whole pre-training phase takes nearly a week with 32 Nvidia Tesla 80G A100 GPUs.
Software Dependencies No The paper mentions implementing the work and references PaddlePaddle in the GitHub link, but it does not specify version numbers for any software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes The learning rate is set for 3e-4 and the batch size is 28. (RVL-CDIP) The learning rate is set to 1e-2, while it decays to 1e-3 on 3 epoch and decays 1e-4 on 6 epoch. (Pub Lay Net) We fine-tune our model end-to-end using ADAM Kingma & Ba (2015) optimizer for 20 epochs with a batch size of 16 and a learning rate of 1e-4. (WTW) We fine-tune the whole model for 1200 epochs with a batch size of 32. We follow a cosine learning rate policy and set the initial learning rate to 5e-4. (FUNSD)