Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Evaluating the Robustness of Analogical Reasoning in Large Language Models

Authors: Martha Lewis, Melanie Mitchell

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	For each of these domains we test humans and GPT models on robustness to variants of the original analogy problems versions that test the same abstract reasoning abilities but that are likely dissimilar from tasks in the pre-training data.
Researcher Affiliation	Academia	Martha Lewis EMAIL ILLC, University of Amsterdam, Amsterdam, The Netherlands Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, USA Melanie Mitchell EMAIL Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, USA
Pseudocode	No	The paper describes the methods and experimental procedures in natural language, without presenting any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code, data, and results for all experiments is available at https://github.com/ marthaflinderslewis/robust-analogy.
Open Datasets	Yes	Code, data, and results for all experiments are available at https://github.com/ marthaflinderslewis/robust-analogy. WHL tested humans, GPT-3, and GPT-4 on a set of 18 story-analogy problems from Gentner et al. (1993).
Dataset Splits	No	The paper evaluates pre-trained language models in a zero-shot setting and describes problem sampling for human participants, but it does not specify explicit training/test/validation dataset splits for model training or evaluation reproducibility in the traditional sense.
Hardware Specification	No	The paper evaluates pre-trained GPT models (GPT-3, GPT-3.5, GPT-4) and does not specify the hardware used by the authors to conduct their experiments.
Software Dependencies	No	The paper mentions specific versions of the evaluated GPT models (e.g., 'GPT-3 (text-davinci-003)', 'GPT-3.5 (gpt-3.5-turbo-0613)', 'GPT-4 (gpt-4-0613)') but does not list any specific software dependencies or libraries with version numbers used for the authors' experimental setup.
Experiment Setup	Yes	Following WHL, all GPT experiments were done with temperature set to zero. GPT-3 takes in a single prompt, whereas GPT-3.5 and GPT-4 take in a list of messages that define the role of the system, input from a user role, and optionally some dialogue with simulated responses from the model given under the role assistant. For our experiments on letter-string analogy variants, we tested three different prompt formats, including one similar to instructions given in our human study and one in a zero-shot chain-of-thought setup.