Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Evaluating the Robustness of Analogical Reasoning in Large Language Models

Authors: Martha Lewis, Melanie Mitchell

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental For each of these domains we test humans and GPT models on robustness to variants of the original analogy problems versions that test the same abstract reasoning abilities but that are likely dissimilar from tasks in the pre-training data.
Researcher Affiliation Academia Martha Lewis EMAIL ILLC, University of Amsterdam, Amsterdam, The Netherlands Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, USA Melanie Mitchell EMAIL Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, USA
Pseudocode No The paper describes the methods and experimental procedures in natural language, without presenting any structured pseudocode or algorithm blocks.
Open Source Code Yes Code, data, and results for all experiments is available at https://github.com/ marthaflinderslewis/robust-analogy.
Open Datasets Yes Code, data, and results for all experiments are available at https://github.com/ marthaflinderslewis/robust-analogy. WHL tested humans, GPT-3, and GPT-4 on a set of 18 story-analogy problems from Gentner et al. (1993).
Dataset Splits No The paper evaluates pre-trained language models in a zero-shot setting and describes problem sampling for human participants, but it does not specify explicit training/test/validation dataset splits for model training or evaluation reproducibility in the traditional sense.
Hardware Specification No The paper evaluates pre-trained GPT models (GPT-3, GPT-3.5, GPT-4) and does not specify the hardware used by the authors to conduct their experiments.
Software Dependencies No The paper mentions specific versions of the evaluated GPT models (e.g., 'GPT-3 (text-davinci-003)', 'GPT-3.5 (gpt-3.5-turbo-0613)', 'GPT-4 (gpt-4-0613)') but does not list any specific software dependencies or libraries with version numbers used for the authors' experimental setup.
Experiment Setup Yes Following WHL, all GPT experiments were done with temperature set to zero. GPT-3 takes in a single prompt, whereas GPT-3.5 and GPT-4 take in a list of messages that define the role of the system, input from a user role, and optionally some dialogue with simulated responses from the model given under the role assistant. For our experiments on letter-string analogy variants, we tested three different prompt formats, including one similar to instructions given in our human study and one in a zero-shot chain-of-thought setup.