ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models
Authors: Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang, Nanyun Peng
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we introduce CONTEXTUAL, a novel dataset featuring human-crafted instructions that require contextsensitive reasoning for text-rich images. We conduct experiments to assess the performance of 14 foundation models (GPT-4V, Gemini-Pro Vision, LLa VA-Next) and establish a human performance baseline. Further, we perform human evaluations of the model responses and observe a significant performance gap of 30.8% between GPT-4V (the current best-performing Large Multimodal Model) and human performance. |
| Researcher Affiliation | Academia | Rohan Wadhawan * 1 Hritik Bansal * 1 Kai-Wei Chang 1 Nanyun Peng 1 Department of Computer Science, University of California Los Angeles, USA. Correspondence to: Rohan Wadhawan <rwadhawan7@g.ucla.edu>, Hritik Bansal <hbansal@g.ucla.edu>. |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Our dataset, code, and leaderboard can be found on the project page https://con-textual.github.io/. [...] We make the dataset2 and code3 available to the LMM community along with a continuously updated leaderboard 4 with recent LMMs. [...] 3Git Hub Code Repository |
| Open Datasets | Yes | In this paper, we introduce CONTEXTUAL, a novel dataset featuring human-crafted instructions that require contextsensitive reasoning for text-rich images. [...] Our dataset, code, and leaderboard can be found on the project page https://con-textual.github.io/. [...] We make the dataset2 and code3 available to the LMM community along with a continuously updated leaderboard 4 with recent LMMs. [...] 2Hugging Face Dataset |
| Dataset Splits | Yes | To facilitate model development, we will release a subset of 100 samples from the 506, as validation set, along with their reference responses, while keeping them hidden for the remaining 406 samples. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications, or cloud instance types) used for running its experiments. |
| Software Dependencies | Yes | We leverage the PP-OCRv4 model of Paddle OCR library (paddlepadle, 2023) for extracting OCR from the images, LATIN prompt (Wang et al., 2023a) inspired OCR text arrangement implementation to maintain layout-awareness in the OCR, and Share GPT-4V-7B for the dense image captions (App. E). |
| Experiment Setup | Yes | We conduct extensive experiments using CONTEXTUAL to assess the reasoning abilities of 14 foundation models over context-sensitive text-rich visual images ( 3.1). This includes three augmented LLMs setups (e.g., GPT-4 (Open AI, 2023a) prompted with combinations of image OCR, image layouts, and image captions), two proprietary LMMs (e.g., GPT-4V(Open AI, 2023b), Gemini-Pro-Vision (Team et al., 2023)), and nine open LMMs (e.g., LLa VA-Next (Liu et al., 2024), LLa VA-1.5 (Liu et al., 2023a), Share GPT-4V(Chen et al., 2023), Idefics (Hugging Face, 2023)). In addition, we perform few-shot experiments for a selected set of models (e.g., Gemini-Pro-Vision, Idefics) to analyze the effect of in-context examples on the model s performance. |