Hierarchical Visual Feature Aggregation for OCR-Free Document Understanding

Authors: JaeYoo Park, Jin Young Choi, Jeonghyung Park, Bohyung Han

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments validate the effectiveness of our approach, demonstrating superior performance in various document understanding tasks.
Researcher Affiliation Collaboration Jaeyoo Park 1 Jin Young Choi2 Jeonghyung Park3 Bohyung Han1,2 1ECE & 2IPAI, Seoul National University 3Samsung SDS {bellos1203, jychoi999, bhhan}@snu.ac.kr jeong.h.park@samsung.com
Pseudocode Yes Figure 2: Illustration of the Hierarchical Visual Feature Aggregation (HVFA) module.
Open Source Code No The paper mentions using specific external libraries and models but does not state that the code for its own methodology is open source or provide a link for it.
Open Datasets Yes We utilize document instruction dataset corpora for training, following [19]. The training dataset includes various types of document images, including tables, charts, natural images, and web page screenshots. The datasets used are Doc VQA [45], Infographics VQA [46], Deep Form [47], Kleister Charity [48], Wiki Table Questions [49], Tab Fact [50], Chart QA [51], Visual MRC [52], Text VQA [53], and Text Caps [54]. The combined dataset comprises approximately 650K image-instruction pairs.
Dataset Splits Yes For all of the benchmarks, we used the train/val/split set provided by UReader [19], which is basically built on DUE benchmark [59].
Hardware Specification Yes The total batch size was set to 256, and we conducted training on 8 A100 GPUs.
Software Dependencies No The paper mentions using 'BLIP-2-OPT-2.7B', 'm PLUG-Owl-7B', 'Lo RA [58]', and 'the transformers library from huggingface' but does not provide specific version numbers for software libraries like transformers or the general programming environment (e.g., Python, PyTorch versions).
Experiment Setup Yes For Lo RA, we set rank r = 8, and α = 32. The maximum sequence length of the LLM is set to 2048... For the Hierarchical Visual Feature Aggregation (HVFA) module, we employed a two-layer multihead cross-attention layer with d = 256 and 12 heads... we set λ = 0.1... we set the minimum coverage cmin to 30%... We trained our model with a learning rate of 1 10 4 for 10 epochs, incorporating a linear warmup of 50 steps, followed by cosine decay to 0. The total batch size was set to 256...