Hierarchical Visual Feature Aggregation for OCR-Free Document Understanding
Authors: JaeYoo Park, Jin Young Choi, Jeonghyung Park, Bohyung Han
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experiments validate the effectiveness of our approach, demonstrating superior performance in various document understanding tasks. |
| Researcher Affiliation | Collaboration | Jaeyoo Park 1 Jin Young Choi2 Jeonghyung Park3 Bohyung Han1,2 1ECE & 2IPAI, Seoul National University 3Samsung SDS {bellos1203, jychoi999, bhhan}@snu.ac.kr jeong.h.park@samsung.com |
| Pseudocode | Yes | Figure 2: Illustration of the Hierarchical Visual Feature Aggregation (HVFA) module. |
| Open Source Code | No | The paper mentions using specific external libraries and models but does not state that the code for its own methodology is open source or provide a link for it. |
| Open Datasets | Yes | We utilize document instruction dataset corpora for training, following [19]. The training dataset includes various types of document images, including tables, charts, natural images, and web page screenshots. The datasets used are Doc VQA [45], Infographics VQA [46], Deep Form [47], Kleister Charity [48], Wiki Table Questions [49], Tab Fact [50], Chart QA [51], Visual MRC [52], Text VQA [53], and Text Caps [54]. The combined dataset comprises approximately 650K image-instruction pairs. |
| Dataset Splits | Yes | For all of the benchmarks, we used the train/val/split set provided by UReader [19], which is basically built on DUE benchmark [59]. |
| Hardware Specification | Yes | The total batch size was set to 256, and we conducted training on 8 A100 GPUs. |
| Software Dependencies | No | The paper mentions using 'BLIP-2-OPT-2.7B', 'm PLUG-Owl-7B', 'Lo RA [58]', and 'the transformers library from huggingface' but does not provide specific version numbers for software libraries like transformers or the general programming environment (e.g., Python, PyTorch versions). |
| Experiment Setup | Yes | For Lo RA, we set rank r = 8, and α = 32. The maximum sequence length of the LLM is set to 2048... For the Hierarchical Visual Feature Aggregation (HVFA) module, we employed a two-layer multihead cross-attention layer with d = 256 and 12 heads... we set λ = 0.1... we set the minimum coverage cmin to 30%... We trained our model with a learning rate of 1 10 4 for 10 epochs, incorporating a linear warmup of 50 steps, followed by cosine decay to 0. The total batch size was set to 256... |