UNIT: Unifying Image and Text Recognition in One Vision Encoder
Authors: Yi Zhu, Zhou Yanpeng, Chunwei Wang, Yang Cao, Jianhua Han, Lu Hou, Hang Xu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments across multiple benchmarks confirm that our method significantly outperforms existing methods on document-related tasks (e.g., OCR and Doc QA) while maintaining the performances on natural images, demonstrating its ability to substantially enhance text recognition without compromising its core image recognition capabilities. |
| Researcher Affiliation | Collaboration | 1Huawei Noah s Ark Lab, 2Hong Kong University of Science and Technology |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | https://github.com/yeezhu/UNIT. |
| Open Datasets | Yes | Dataset Preparation. Let D be a curated dataset with 5M samples, where D = DI 1 DT 4. Here, DI 1 represents a dataset of natural images annotated with coarse captions (less than 30 words). The images are resized to 1 times the original scale r, primarily sourced from the Conceptual Caption dataset [46], comprising 3M samples. Meanwhile, DT 4 represents a dataset of documents annotated with dense OCR data (more than 500 words). The document images are resized 4 times the original scale r, sourced from our synthetic dataset of English PDF documents, comprising 2M samples. |
| Dataset Splits | Yes | Dataset Preparation. Let D be a curated dataset with 5M samples, where D = DI 1 DT 4. Here, DI 1 represents a dataset of natural images annotated with coarse captions (less than 30 words). The images are resized to 1 times the original scale r, primarily sourced from the Conceptual Caption dataset [46], comprising 3M samples. Meanwhile, DT 4 represents a dataset of documents annotated with dense OCR data (more than 500 words). The document images are resized 4 times the original scale r, sourced from our synthetic dataset of English PDF documents, comprising 2M samples. ... In the Supervised Finetuning (SFT) stage, we use the LLa VA-80k or LLa VA-CC665k along with the train set of Doc VQA [36] and Chart QA [34] as the fine-tuning dataset. |
| Hardware Specification | Yes | The training process requires 128 Ascend 910B GPUs (each with 64 GB of memory). |
| Software Dependencies | Yes | We implement our method using Py Torch 2.1.0 with CUDA 11.7. |
| Experiment Setup | Yes | Our optimization strategy involves Adam W [32] with a weight decay 0.01. The initial learning rate to 5e-5, and changes with a cosine learning rate decay scheduler. The warmup ratio is set to 0.03, and the global batch size is 256. We set loss weights λ = 2 and µ = 0.2. These settings are shared for both two training stages. |