Harmonizing Visual Text Comprehension and Generation

Authors: Zhen Zhao, Jingqun Tang, Binghong Wu, Chunhui Lin, Shu Wei, Hao Liu, Xin Tan, zhizhong zhang, Can Huang, Yuan Xie

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments across various benchmarks demonstrate the effectiveness of the proposed approach. Empowered by Slide-Lo RA, Text Harmony achieves comparable performance to modality-specific fine-tuning results with only a 2% increase in parameters and shows an average improvement of 2.5% in visual text comprehension tasks and 4.0% in visual text generation tasks.
Researcher Affiliation Collaboration 1East China Normal University 2 Byte Dance 3 Shanghai Key Laboratory of Computer Software Evaluating and Testing, Shanghai, China 51255901056@stu.ecnu.edu.cn, tangjingqun@bytedance.com, yxie@cs.ecnu.edu.cn
Pseudocode No The paper describes its model architecture and training process in detail but does not include any formal pseudocode blocks or algorithms labeled as such.
Open Source Code Yes Code is available at https://github.com/bytedance/Text Harmony.
Open Datasets Yes Text Harmony is pre-trained based on the pre-training weight of MM-Interleaved [65], with extra text-rich datasets including MARIO-LAION [6] and Doc Struct4M [21].
Dataset Splits Yes Datasets and Metrics. We evaluate Text Harmony on a broad range of vision-language tasks. Visual Text Comprehension includes Document-Oriented VQA (Info VQA [43], Doc VQA [44], Chart QA [42]), Table VQA (Tab Fact [8], WTQ [47]), Scene Text-Centric VQA (Text VQA [55], OCRVQA [45], STVQA [3]) and OCRBench [38].
Hardware Specification Yes The pre-training stage takes 3264 A-100 hours with a batch size of 256; While the fine-tuning stage takes 2352 A-100 hours with a batch size of 64.
Software Dependencies No The paper mentions models/frameworks like 'CLIP-Vi T-L/14', 'Vicuna-13B', and 'Stable Diffusion v2.1' but does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes The image resolution is increased to 896 to capture fine-grained features better. A Q-Former with 12 blocks is adopted to reduce the number of visual tokens to 512. In the multi-modal pre-training stage, the initial learning rate is set to 1e 5, while in the fine-tuning stage, it is reduced to 5e 6. The pre-training stage takes 3264 A-100 hours with a batch size of 256; While the fine-tuning stage takes 2352 A-100 hours with a batch size of 64.