Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning
Authors: Alex Jinpeng Wang, Linjie Li, Yiqi Lin, Min Li, Lijuan Wang, Mike Zheng Shou
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This exploratory study does not present state-of-the-art models; rather, it introduces an innovative method designed to increase in-context text length in multi-modality large language models (MLLMs) efficiently... Experimental results demonstrate that model trained with Vis In Context delivers superior performance on common downstream benchmarks for in-context few-shot evaluation. |
| Researcher Affiliation | Collaboration | Alex Jinpeng Wang1, Linjie Li2, Yiqi Lin1, Min Li3, Lijuan Wang2, and Mike Zheng Shou1B 1Show Lab, National University of Singapore 2Microsoft 3Central South University |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at https://github.com/showlab/Vis In Context. |
| Open Datasets | Yes | Our pretraining dataset includes a 180M subset of Data Comp1B [26], MMC4 [13], the OBELICS [14] dataset, and OCR Rendered Text [27]. |
| Dataset Splits | Yes | We report answer accuracy on the OK-VQA [28], Text VQA [29], Viz Wiz [30], and VQAV2 [31]. Additionally, we assess performance on captioning tasks using COCO [32] and Flickr30K [33]. Moreover, we also propose a setting named text-only in context few-shots to explore text-only in-context evaluation. In order to illustrate the impact of having long in-context text, we evaluate the model for document understanding on Doc VQA [15] and OCR VQA [34]. |
| Hardware Specification | Yes | Vis In Context significantly increases the in-context text length from 256 to 2048 during pretraining on NVIDIA H100 GPU... the In-context Text Length can be extended up to 9192 tokens for the 56B MOE model on 80GB H100 GPUs with our method at inference stage. |
| Software Dependencies | No | The paper mentions using Deep Speed [3] Zero-2 stage with CPU offloading and quantizing to 4-bit precision, with a footnote pointing to "https://github.com/Tim Dettmers/bitsandbytes". However, specific version numbers for Deep Speed, PyTorch, or bitsandbytes are not explicitly stated. |
| Experiment Setup | Yes | Pretraining. We validate Vis In Context with Open-Flamingo [9] and Cos Mo [25]. To enhance computational efficiency, all models utilize float16 precision. For the 56B MOE [2] model, we employ Deep Speed s [3] Zero-2 stage with CPU offloading and further optimize the model by quantizing it to 4-bit precision 1. We also use Flash Attention [17] to further improve memory efficiency. For all other experiments, we train the model using Deep Speed Zero-2 without CPU off-loading. The Open-Flamingo 9B baseline is based on Mistral7B [5]. Our pretraining dataset includes a 180M subset of Data Comp1B [26], MMC4 [13], the OBELICS [14] dataset, and OCR Rendered Text [27]. (More details are provided in the Appendix B.1). For each input document or image-text pair, we render a text sequence into an image with a fixed size of 16x8192 (512 patches) by default, with ph = pw = 16. Appendix B.2 provides Table 10 with detailed hyperparameters like learning rate, batch size, max training steps, optimizer, etc. |