reproducibilityindex.ai

Hierarchical Visual Feature Aggregation for OCR-Free Document Understanding

Authors: JaeYoo Park, Jin Young Choi, Jeonghyung Park, Bohyung Han

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experiments validate the effectiveness of our approach, demonstrating superior performance in various document understanding tasks.
Researcher Affiliation	Collaboration	Jaeyoo Park 1 Jin Young Choi2 Jeonghyung Park3 Bohyung Han1,2 1ECE & 2IPAI, Seoul National University 3Samsung SDS {bellos1203, jychoi999, bhhan}@snu.ac.kr jeong.h.park@samsung.com
Pseudocode	Yes	Figure 2: Illustration of the Hierarchical Visual Feature Aggregation (HVFA) module.
Open Source Code	No	The paper mentions using specific external libraries and models but does not state that the code for its own methodology is open source or provide a link for it.
Open Datasets	Yes	We utilize document instruction dataset corpora for training, following [19]. The training dataset includes various types of document images, including tables, charts, natural images, and web page screenshots. The datasets used are Doc VQA [45], Infographics VQA [46], Deep Form [47], Kleister Charity [48], Wiki Table Questions [49], Tab Fact [50], Chart QA [51], Visual MRC [52], Text VQA [53], and Text Caps [54]. The combined dataset comprises approximately 650K image-instruction pairs.
Dataset Splits	Yes	For all of the benchmarks, we used the train/val/split set provided by UReader [19], which is basically built on DUE benchmark [59].
Hardware Specification	Yes	The total batch size was set to 256, and we conducted training on 8 A100 GPUs.
Software Dependencies	No	The paper mentions using 'BLIP-2-OPT-2.7B', 'm PLUG-Owl-7B', 'Lo RA [58]', and 'the transformers library from huggingface' but does not provide specific version numbers for software libraries like transformers or the general programming environment (e.g., Python, PyTorch versions).
Experiment Setup	Yes	For Lo RA, we set rank r = 8, and α = 32. The maximum sequence length of the LLM is set to 2048... For the Hierarchical Visual Feature Aggregation (HVFA) module, we employed a two-layer multihead cross-attention layer with d = 256 and 12 heads... we set λ = 0.1... we set the minimum coverage cmin to 30%... We trained our model with a learning rate of 1 10 4 for 10 epochs, incorporating a linear warmup of 50 steps, followed by cosine decay to 0. The total batch size was set to 256...