reproducibilityindex.ai

ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models

Authors: Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang, Nanyun Peng

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we introduce CONTEXTUAL, a novel dataset featuring human-crafted instructions that require contextsensitive reasoning for text-rich images. We conduct experiments to assess the performance of 14 foundation models (GPT-4V, Gemini-Pro Vision, LLa VA-Next) and establish a human performance baseline. Further, we perform human evaluations of the model responses and observe a signiﬁcant performance gap of 30.8% between GPT-4V (the current best-performing Large Multimodal Model) and human performance.
Researcher Affiliation	Academia	Rohan Wadhawan * 1 Hritik Bansal * 1 Kai-Wei Chang 1 Nanyun Peng 1 Department of Computer Science, University of California Los Angeles, USA. Correspondence to: Rohan Wadhawan <rwadhawan7@g.ucla.edu>, Hritik Bansal <hbansal@g.ucla.edu>.
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	Our dataset, code, and leaderboard can be found on the project page https://con-textual.github.io/. [...] We make the dataset2 and code3 available to the LMM community along with a continuously updated leaderboard 4 with recent LMMs. [...] 3Git Hub Code Repository
Open Datasets	Yes	In this paper, we introduce CONTEXTUAL, a novel dataset featuring human-crafted instructions that require contextsensitive reasoning for text-rich images. [...] Our dataset, code, and leaderboard can be found on the project page https://con-textual.github.io/. [...] We make the dataset2 and code3 available to the LMM community along with a continuously updated leaderboard 4 with recent LMMs. [...] 2Hugging Face Dataset
Dataset Splits	Yes	To facilitate model development, we will release a subset of 100 samples from the 506, as validation set, along with their reference responses, while keeping them hidden for the remaining 406 samples.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications, or cloud instance types) used for running its experiments.
Software Dependencies	Yes	We leverage the PP-OCRv4 model of Paddle OCR library (paddlepadle, 2023) for extracting OCR from the images, LATIN prompt (Wang et al., 2023a) inspired OCR text arrangement implementation to maintain layout-awareness in the OCR, and Share GPT-4V-7B for the dense image captions (App. E).
Experiment Setup	Yes	We conduct extensive experiments using CONTEXTUAL to assess the reasoning abilities of 14 foundation models over context-sensitive text-rich visual images ( 3.1). This includes three augmented LLMs setups (e.g., GPT-4 (Open AI, 2023a) prompted with combinations of image OCR, image layouts, and image captions), two proprietary LMMs (e.g., GPT-4V(Open AI, 2023b), Gemini-Pro-Vision (Team et al., 2023)), and nine open LMMs (e.g., LLa VA-Next (Liu et al., 2024), LLa VA-1.5 (Liu et al., 2023a), Share GPT-4V(Chen et al., 2023), Ideﬁcs (Hugging Face, 2023)). In addition, we perform few-shot experiments for a selected set of models (e.g., Gemini-Pro-Vision, Ideﬁcs) to analyze the effect of in-context examples on the model s performance.