Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Hierarchical Visual Feature Aggregation for OCR-Free Document Understanding
Authors: JaeYoo Park, Jin Young Choi, Jeonghyung Park, Bohyung Han
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experiments validate the effectiveness of our approach, demonstrating superior performance in various document understanding tasks. |
| Researcher Affiliation | Collaboration | Jaeyoo Park 1 Jin Young Choi2 Jeonghyung Park3 Bohyung Han1,2 1ECE & 2IPAI, Seoul National University 3Samsung SDS EMAIL EMAIL |
| Pseudocode | Yes | Figure 2: Illustration of the Hierarchical Visual Feature Aggregation (HVFA) module. |
| Open Source Code | No | The paper mentions using specific external libraries and models but does not state that the code for its own methodology is open source or provide a link for it. |
| Open Datasets | Yes | We utilize document instruction dataset corpora for training, following [19]. The training dataset includes various types of document images, including tables, charts, natural images, and web page screenshots. The datasets used are Doc VQA [45], Infographics VQA [46], Deep Form [47], Kleister Charity [48], Wiki Table Questions [49], Tab Fact [50], Chart QA [51], Visual MRC [52], Text VQA [53], and Text Caps [54]. The combined dataset comprises approximately 650K image-instruction pairs. |
| Dataset Splits | Yes | For all of the benchmarks, we used the train/val/split set provided by UReader [19], which is basically built on DUE benchmark [59]. |
| Hardware Specification | Yes | The total batch size was set to 256, and we conducted training on 8 A100 GPUs. |
| Software Dependencies | No | The paper mentions using 'BLIP-2-OPT-2.7B', 'm PLUG-Owl-7B', 'Lo RA [58]', and 'the transformers library from huggingface' but does not provide specific version numbers for software libraries like transformers or the general programming environment (e.g., Python, PyTorch versions). |
| Experiment Setup | Yes | For Lo RA, we set rank r = 8, and α = 32. The maximum sequence length of the LLM is set to 2048... For the Hierarchical Visual Feature Aggregation (HVFA) module, we employed a two-layer multihead cross-attention layer with d = 256 and 12 heads... we set λ = 0.1... we set the minimum coverage cmin to 30%... We trained our model with a learning rate of 1 10 4 for 10 epochs, incorporating a linear warmup of 50 steps, followed by cosine decay to 0. The total batch size was set to 256... |