Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Authors: Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, Jianfeng Gao

ICLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	With MATHVISTA, we have conducted a comprehensive, quantitative evaluation of 12 prominent foundation models.
Researcher Affiliation	Collaboration	1UCLA, 2University of Washington, 3Microsoft Research, Redmond
Pseudocode	Yes	Figure 6: Two examples from GPT-4. GPT-4 depends on the qualities of the generated caption and detected OCR texts. In (b), some information is incorrect, even though the final answer is correct. (a) Correct answer and code
Open Source Code	No	The paper provides a project website (https://mathvista.github.io) but does not contain an explicit, unambiguous statement that the source code for the methodology is openly available or a direct link to a code repository for their work.
Open Datasets	Yes	We collected nine Math QA datasets in multimodal settings, including four for GPS, two for MWP with visual contexts of synthetic scenes, abstract diagrams, and tables, and two for TQA on college curricula (see C.4)...We reviewed more than 70 datasets, collecting 19 of them that contain math-related instances and are publicly available, as listed in C.4.
Dataset Splits	Yes	MATHVISTA consists of 6,141 examples, divided into two subsets: testmini and test. testmini contains 1,000 examples, intended for model development validation or for those with limited comput-ing resources.
Hardware Specification	No	The paper mentions specific models like GPT-4V and Bard, which are commercial products, but it does not provide specific hardware details (e.g., GPU models, CPU types, memory) used for their experiments.
Software Dependencies	No	The paper mentions software components and models like 'Easy OCR (Jaided AI, 2020)' and 'Chat GPT (Open AI, 2022)' but does not provide specific version numbers for these software dependencies or libraries.
Experiment Setup	Yes	We provide the prompts for LLMs and the hyperparameters used for LMMs in F.