Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Cross-Lingual Text-Rich Visual Comprehension: An Information Theory Perspective
Authors: Xinmiao Yu, Xiaocheng Feng, Yun Li, Minghui Liao, Ya-Qi Yu, Xiachong Feng, Weihong Zhong, Ruihan Chen, Mengkang Hu, Jihao Wu, Duyu Tang, Dandan Tu, Bing Qin
AAAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation of prominent LVLMs on XT-VQA reveals a significant drop in performance for cross-lingual scenarios, even for models with multilingual capabilities. A mutual information analysis suggests that this performance gap stems from crosslingual questions failing to adequately activate relevant visual information. To mitigate this issue, we propose MVCL-MI (Maximization of Vision-Language Cross-Lingual Mutual Information), where a visual-text cross-lingual alignment is built by maximizing mutual information between the model s outputs and visual information. This is achieved by distilling knowledge from monolingual to cross-lingual settings through KL divergence minimization, where monolingual output logits serve as a teacher. Experimental results on the XT-VQA demonstrate that MVCL-MI effectively reduces the visual-text cross-lingual performance disparity while preserving the inherent capabilities of LVLMs, shedding new light on the potential practice for improving LVLMs. |
| Researcher Affiliation | Collaboration | 1Harbin Institute of Technology 2Huawei Inc. 3The University of Hong Kong EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper only describes methodologies and processes using natural language and a diagram (Figure 2: The XPaper QA dataset construction pipeline) without any explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper discusses open-source third-party LVLMs (e.g., LLaVA, mPlug-Owl2, Qwen-VL-Chat, Monkey, Cog-VLM, Mini CPM-V) used for benchmarking, but it does not explicitly state the release of their own method's (MVCL-MI) source code or provide a link to a repository for it. |
| Open Datasets | Yes | To address this, we introduce XT-VQA (Cross Lingual Text-Rich Visual Question Answering), a benchmark designed to assess how LVLMs handle language inconsistency between image text and questions. XT-VQA integrates five existing text-rich VQA datasets and a newly collected dataset, XPaper QA, covering diverse scenarios that require faithful recognition and comprehension of visual information despite language inconsistency. [...] XTVQA integrates multiple existing VQA datasets (Mathew, Karatzas, and Jawahar 2021; Masry et al. 2022; Singh et al. 2019; Mishra et al. 2019) and introduces the newly curated XPaper QA dataset, which focuses on bilingual academic papers. [...] For English papers, we reconstruct the QASPER dataset (Dasigi et al. 2021), which contains 5,049 questions across 1,585 Natural Language Processing papers... |
| Dataset Splits | No | The paper introduces a new benchmark XT-VQA and a newly collected dataset XPaper QA, but it does not specify the training, validation, and test splits for these datasets, nor does it provide details on how existing datasets were split after multilingual extension. A random sample of 100 examples from Chart QA is mentioned for analysis, but this is not a full dataset split for reproduction. |
| Hardware Specification | Yes | Training was done on 8 A100sxm4-80gb for 1 epoch with the default configuration and hyperparameters. This setup is detailed in the Appendix. |
| Software Dependencies | No | The paper states 'We deploy our method on the advanced LVLM Mini CPM-Llama3-V' but does not provide specific version numbers for software dependencies such as programming languages, libraries, or frameworks used for implementation beyond the base model. |
| Experiment Setup | Yes | Experimental Setup We use the respective prompt set by LVLM to get its best performance and set the temperature to the default value in the model implementation. OCR tokens were provided if the model required them by default. [...] Training was done on 8 A100sxm4-80gb for 1 epoch with the default configuration and hyperparameters. This setup is detailed in the Appendix. |