Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
UNIT: Unifying Image and Text Recognition in One Vision Encoder
Authors: Yi Zhu, Zhou Yanpeng, Chunwei Wang, Yang Cao, Jianhua Han, Lu Hou, Hang Xu
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments across multiple benchmarks confirm that our method significantly outperforms existing methods on document-related tasks (e.g., OCR and Doc QA) while maintaining the performances on natural images, demonstrating its ability to substantially enhance text recognition without compromising its core image recognition capabilities. |
| Researcher Affiliation | Collaboration | 1Huawei Noah s Ark Lab, 2Hong Kong University of Science and Technology |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | https://github.com/yeezhu/UNIT. |
| Open Datasets | Yes | Dataset Preparation. Let D be a curated dataset with 5M samples, where D = DI 1 DT 4. Here, DI 1 represents a dataset of natural images annotated with coarse captions (less than 30 words). The images are resized to 1 times the original scale r, primarily sourced from the Conceptual Caption dataset [46], comprising 3M samples. Meanwhile, DT 4 represents a dataset of documents annotated with dense OCR data (more than 500 words). The document images are resized 4 times the original scale r, sourced from our synthetic dataset of English PDF documents, comprising 2M samples. |
| Dataset Splits | Yes | Dataset Preparation. Let D be a curated dataset with 5M samples, where D = DI 1 DT 4. Here, DI 1 represents a dataset of natural images annotated with coarse captions (less than 30 words). The images are resized to 1 times the original scale r, primarily sourced from the Conceptual Caption dataset [46], comprising 3M samples. Meanwhile, DT 4 represents a dataset of documents annotated with dense OCR data (more than 500 words). The document images are resized 4 times the original scale r, sourced from our synthetic dataset of English PDF documents, comprising 2M samples. ... In the Supervised Finetuning (SFT) stage, we use the LLa VA-80k or LLa VA-CC665k along with the train set of Doc VQA [36] and Chart QA [34] as the fine-tuning dataset. |
| Hardware Specification | Yes | The training process requires 128 Ascend 910B GPUs (each with 64 GB of memory). |
| Software Dependencies | Yes | We implement our method using Py Torch 2.1.0 with CUDA 11.7. |
| Experiment Setup | Yes | Our optimization strategy involves Adam W [32] with a weight decay 0.01. The initial learning rate to 5e-5, and changes with a cosine learning rate decay scheduler. The warmup ratio is set to 0.03, and the global batch size is 256. We set loss weights ฮป = 2 and ยต = 0.2. These settings are shared for both two training stages. |