ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models

Authors: Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang, Nanyun Peng

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we introduce CONTEXTUAL, a novel dataset featuring human-crafted instructions that require contextsensitive reasoning for text-rich images. We conduct experiments to assess the performance of 14 foundation models (GPT-4V, Gemini-Pro Vision, LLa VA-Next) and establish a human performance baseline. Further, we perform human evaluations of the model responses and observe a significant performance gap of 30.8% between GPT-4V (the current best-performing Large Multimodal Model) and human performance.
Researcher Affiliation Academia Rohan Wadhawan * 1 Hritik Bansal * 1 Kai-Wei Chang 1 Nanyun Peng 1 Department of Computer Science, University of California Los Angeles, USA. Correspondence to: Rohan Wadhawan <rwadhawan7@g.ucla.edu>, Hritik Bansal <hbansal@g.ucla.edu>.
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Our dataset, code, and leaderboard can be found on the project page https://con-textual.github.io/. [...] We make the dataset2 and code3 available to the LMM community along with a continuously updated leaderboard 4 with recent LMMs. [...] 3Git Hub Code Repository
Open Datasets Yes In this paper, we introduce CONTEXTUAL, a novel dataset featuring human-crafted instructions that require contextsensitive reasoning for text-rich images. [...] Our dataset, code, and leaderboard can be found on the project page https://con-textual.github.io/. [...] We make the dataset2 and code3 available to the LMM community along with a continuously updated leaderboard 4 with recent LMMs. [...] 2Hugging Face Dataset
Dataset Splits Yes To facilitate model development, we will release a subset of 100 samples from the 506, as validation set, along with their reference responses, while keeping them hidden for the remaining 406 samples.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications, or cloud instance types) used for running its experiments.
Software Dependencies Yes We leverage the PP-OCRv4 model of Paddle OCR library (paddlepadle, 2023) for extracting OCR from the images, LATIN prompt (Wang et al., 2023a) inspired OCR text arrangement implementation to maintain layout-awareness in the OCR, and Share GPT-4V-7B for the dense image captions (App. E).
Experiment Setup Yes We conduct extensive experiments using CONTEXTUAL to assess the reasoning abilities of 14 foundation models over context-sensitive text-rich visual images ( 3.1). This includes three augmented LLMs setups (e.g., GPT-4 (Open AI, 2023a) prompted with combinations of image OCR, image layouts, and image captions), two proprietary LMMs (e.g., GPT-4V(Open AI, 2023b), Gemini-Pro-Vision (Team et al., 2023)), and nine open LMMs (e.g., LLa VA-Next (Liu et al., 2024), LLa VA-1.5 (Liu et al., 2023a), Share GPT-4V(Chen et al., 2023), Idefics (Hugging Face, 2023)). In addition, we perform few-shot experiments for a selected set of models (e.g., Gemini-Pro-Vision, Idefics) to analyze the effect of in-context examples on the model s performance.