StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training
Authors: Yuechen Yu, Yulin Li, Chengquan Zhang, Xiaoqiang Zhang, Zengyuan Guo, Xiameng Qin, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on mainstream benchmarks of document image understanding demonstrate the effectiveness of StrucTexTv2. It achieves competitive or even new state-of-the-art performance in various downstream tasks such as image classification, layout analysis, table structure recognition, document OCR, and information extraction under the end-to-end scenario. |
| Researcher Affiliation | Industry | Yuechen Yu , Yulin Li , Chengquan Zhang , Xiaoqiang Zhang, Zengyuan Guo, Xiameng Qin, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang Department of Computer Vision Technology (VIS), Baidu Inc. |
| Pseudocode | No | The paper describes the architecture and processes in text and with diagrams, but it does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | https://github.com/PaddlePaddle/VIMER/tree/main/StrucTexT |
| Open Datasets | Yes | By following DiT Li et al. (2022), we pretrain StrucTexTv2 on the IIT-CDIP Test Collection 1.0 dataset Lewis et al. (2006), whose 11 million multi-page documents are split into single pages, totally 42 million document images. RVL-CDIP Harley et al. (2015) contains 400,000 grayscale document images in 16 classes... Pub Lay Net Zhong et al. (2019) consists of more than 360,000 paper images... WTW Long et al. (2021) covers unconstrained table in natural scene... FUNSD Jaume et al. (2019) is a form understanding dataset that contains 199 forms... |
| Dataset Splits | Yes | We evaluate on the validation set of Pub Lay Net for document layout analysis. We fine-tune the model on RVL-CDIP for 20 epochs with cross-entropy loss. |
| Hardware Specification | Yes | The whole pre-training phase takes nearly a week with 32 Nvidia Tesla 80G A100 GPUs. |
| Software Dependencies | No | The paper mentions implementing the work and references PaddlePaddle in the GitHub link, but it does not specify version numbers for any software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | The learning rate is set for 3e-4 and the batch size is 28. (RVL-CDIP) The learning rate is set to 1e-2, while it decays to 1e-3 on 3 epoch and decays 1e-4 on 6 epoch. (Pub Lay Net) We fine-tune our model end-to-end using ADAM Kingma & Ba (2015) optimizer for 20 epochs with a batch size of 16 and a learning rate of 1e-4. (WTW) We fine-tune the whole model for 1200 epochs with a batch size of 32. We follow a cosine learning rate policy and set the initial learning rate to 5e-4. (FUNSD) |