Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning

Authors: Alex Jinpeng Wang, Linjie Li, Yiqi Lin, Min Li, Lijuan Wang, Mike Zheng Shou

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This exploratory study does not present state-of-the-art models; rather, it introduces an innovative method designed to increase in-context text length in multi-modality large language models (MLLMs) efficiently... Experimental results demonstrate that model trained with Vis In Context delivers superior performance on common downstream benchmarks for in-context few-shot evaluation.
Researcher Affiliation Collaboration Alex Jinpeng Wang1, Linjie Li2, Yiqi Lin1, Min Li3, Lijuan Wang2, and Mike Zheng Shou1B 1Show Lab, National University of Singapore 2Microsoft 3Central South University
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/showlab/Vis In Context.
Open Datasets Yes Our pretraining dataset includes a 180M subset of Data Comp1B [26], MMC4 [13], the OBELICS [14] dataset, and OCR Rendered Text [27].
Dataset Splits Yes We report answer accuracy on the OK-VQA [28], Text VQA [29], Viz Wiz [30], and VQAV2 [31]. Additionally, we assess performance on captioning tasks using COCO [32] and Flickr30K [33]. Moreover, we also propose a setting named text-only in context few-shots to explore text-only in-context evaluation. In order to illustrate the impact of having long in-context text, we evaluate the model for document understanding on Doc VQA [15] and OCR VQA [34].
Hardware Specification Yes Vis In Context significantly increases the in-context text length from 256 to 2048 during pretraining on NVIDIA H100 GPU... the In-context Text Length can be extended up to 9192 tokens for the 56B MOE model on 80GB H100 GPUs with our method at inference stage.
Software Dependencies No The paper mentions using Deep Speed [3] Zero-2 stage with CPU offloading and quantizing to 4-bit precision, with a footnote pointing to "https://github.com/Tim Dettmers/bitsandbytes". However, specific version numbers for Deep Speed, PyTorch, or bitsandbytes are not explicitly stated.
Experiment Setup Yes Pretraining. We validate Vis In Context with Open-Flamingo [9] and Cos Mo [25]. To enhance computational efficiency, all models utilize float16 precision. For the 56B MOE [2] model, we employ Deep Speed s [3] Zero-2 stage with CPU offloading and further optimize the model by quantizing it to 4-bit precision 1. We also use Flash Attention [17] to further improve memory efficiency. For all other experiments, we train the model using Deep Speed Zero-2 without CPU off-loading. The Open-Flamingo 9B baseline is based on Mistral7B [5]. Our pretraining dataset includes a 180M subset of Data Comp1B [26], MMC4 [13], the OBELICS [14] dataset, and OCR Rendered Text [27]. (More details are provided in the Appendix B.1). For each input document or image-text pair, we render a text sequence into an image with a fixed size of 16x8192 (512 patches) by default, with ph = pw = 16. Appendix B.2 provides Table 10 with detailed hyperparameters like learning rate, batch size, max training steps, optimizer, etc.