Harmonizing Visual Text Comprehension and Generation
Authors: Zhen Zhao, Jingqun Tang, Binghong Wu, Chunhui Lin, Shu Wei, Hao Liu, Xin Tan, zhizhong zhang, Can Huang, Yuan Xie
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experiments across various benchmarks demonstrate the effectiveness of the proposed approach. Empowered by Slide-Lo RA, Text Harmony achieves comparable performance to modality-specific fine-tuning results with only a 2% increase in parameters and shows an average improvement of 2.5% in visual text comprehension tasks and 4.0% in visual text generation tasks. |
| Researcher Affiliation | Collaboration | 1East China Normal University 2 Byte Dance 3 Shanghai Key Laboratory of Computer Software Evaluating and Testing, Shanghai, China 51255901056@stu.ecnu.edu.cn, tangjingqun@bytedance.com, yxie@cs.ecnu.edu.cn |
| Pseudocode | No | The paper describes its model architecture and training process in detail but does not include any formal pseudocode blocks or algorithms labeled as such. |
| Open Source Code | Yes | Code is available at https://github.com/bytedance/Text Harmony. |
| Open Datasets | Yes | Text Harmony is pre-trained based on the pre-training weight of MM-Interleaved [65], with extra text-rich datasets including MARIO-LAION [6] and Doc Struct4M [21]. |
| Dataset Splits | Yes | Datasets and Metrics. We evaluate Text Harmony on a broad range of vision-language tasks. Visual Text Comprehension includes Document-Oriented VQA (Info VQA [43], Doc VQA [44], Chart QA [42]), Table VQA (Tab Fact [8], WTQ [47]), Scene Text-Centric VQA (Text VQA [55], OCRVQA [45], STVQA [3]) and OCRBench [38]. |
| Hardware Specification | Yes | The pre-training stage takes 3264 A-100 hours with a batch size of 256; While the fine-tuning stage takes 2352 A-100 hours with a batch size of 64. |
| Software Dependencies | No | The paper mentions models/frameworks like 'CLIP-Vi T-L/14', 'Vicuna-13B', and 'Stable Diffusion v2.1' but does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | The image resolution is increased to 896 to capture fine-grained features better. A Q-Former with 12 blocks is adopted to reduce the number of visual tokens to 512. In the multi-modal pre-training stage, the initial learning rate is set to 1e 5, while in the fine-tuning stage, it is reduced to 5e 6. The pre-training stage takes 3264 A-100 hours with a batch size of 256; While the fine-tuning stage takes 2352 A-100 hours with a batch size of 64. |