UNIT: Unifying Image and Text Recognition in One Vision Encoder

Authors: Yi Zhu, Zhou Yanpeng, Chunwei Wang, Yang Cao, Jianhua Han, Lu Hou, Hang Xu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments across multiple benchmarks confirm that our method significantly outperforms existing methods on document-related tasks (e.g., OCR and Doc QA) while maintaining the performances on natural images, demonstrating its ability to substantially enhance text recognition without compromising its core image recognition capabilities.
Researcher Affiliation Collaboration 1Huawei Noah s Ark Lab, 2Hong Kong University of Science and Technology
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes https://github.com/yeezhu/UNIT.
Open Datasets Yes Dataset Preparation. Let D be a curated dataset with 5M samples, where D = DI 1 DT 4. Here, DI 1 represents a dataset of natural images annotated with coarse captions (less than 30 words). The images are resized to 1 times the original scale r, primarily sourced from the Conceptual Caption dataset [46], comprising 3M samples. Meanwhile, DT 4 represents a dataset of documents annotated with dense OCR data (more than 500 words). The document images are resized 4 times the original scale r, sourced from our synthetic dataset of English PDF documents, comprising 2M samples.
Dataset Splits Yes Dataset Preparation. Let D be a curated dataset with 5M samples, where D = DI 1 DT 4. Here, DI 1 represents a dataset of natural images annotated with coarse captions (less than 30 words). The images are resized to 1 times the original scale r, primarily sourced from the Conceptual Caption dataset [46], comprising 3M samples. Meanwhile, DT 4 represents a dataset of documents annotated with dense OCR data (more than 500 words). The document images are resized 4 times the original scale r, sourced from our synthetic dataset of English PDF documents, comprising 2M samples. ... In the Supervised Finetuning (SFT) stage, we use the LLa VA-80k or LLa VA-CC665k along with the train set of Doc VQA [36] and Chart QA [34] as the fine-tuning dataset.
Hardware Specification Yes The training process requires 128 Ascend 910B GPUs (each with 64 GB of memory).
Software Dependencies Yes We implement our method using Py Torch 2.1.0 with CUDA 11.7.
Experiment Setup Yes Our optimization strategy involves Adam W [32] with a weight decay 0.01. The initial learning rate to 5e-5, and changes with a cosine learning rate decay scheduler. The warmup ratio is set to 0.03, and the global batch size is 256. We set loss weights λ = 2 and µ = 0.2. These settings are shared for both two training stages.