Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Generating Computational Cognitive models using Large Language Models

Authors: Milena Rmus, Akshay Kumar Jagadish, Marvin Mathony, Tobias Ludwig, Eric Schulz

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Building on this potential, we introduce a pipeline for Guided generation of Computational Cognitive Models (Ge CCo). Given task instructions, participant data, and a template function, Ge CCo prompts an LLM to propose candidate models, fits proposals to held-out data, and iteratively refines them based on feedback constructed from their predictive performance. We benchmark this approach across four different cognitive domains decision making, learning, planning, and memory using three open-source LLMs, spanning different model sizes, capacities, and families. On four human behavioral data sets, the LLM generated models consistently matched or outperformed the best domain-specific models from the cognitive science literature.
Researcher Affiliation	Academia	Milena Rmus Helmholtz Munich EMAIL Akshay K. Jagadish Princeton University EMAIL Marvin Mathony Helmholtz Munich EMAIL Tobias Ludwig Tübingen University EMAIL Eric Schulz Helmholtz Munich EMAIL
Pseudocode	Yes	Figure 1: Schematic of Ge CCo: We prompt the LLM with a task description, participant data, guardrails to constrain the format of LLM responses, and the code template to generate cognitive models that offer different explanations of the underlying data as Python functions.
Open Source Code	Yes	The code for Ge CCo is available at https://github.com/Milena CCNlab/gecco.git
Open Datasets	Yes	On four human behavioral data sets, the LLM generated models consistently matched or outperformed the best domain-specific models from the cognitive science literature.
Dataset Splits	Yes	On each iteration, the base LLM was prompted using a fixed prompt structure (see Appendix 6) that included 1) a natural language task description, 2) behavioral data from a subset of participants (prompt data)... Each model was then fit to a second, held-out validation dataset (not shown in the prompt)... After completing all iterations in a run, the best LLM-generated model compared against competing cognitive models by evaluating on a third, held-out test set.
Hardware Specification	Yes	In practice, our approach takes a maximum of 8 hours per task domain on four Nvidia A100s with 40GB memory each.
Software Dependencies	No	Each model was then fit to a second, held-out validation dataset (not shown in the prompt) using the minimize function from the Sci Py optimization library (Virtanen et al., 2020) [...]. The paper mentions SciPy but no specific version number for SciPy or Python.
Experiment Setup	Yes	We set the temperature to 0.2 for Llama models, 0.15 for Qwen and 0.1 for R1 to encourage some exploration when generating models. In practice, our approach takes a maximum of 8 hours per task domain on four Nvidia A100s with 40GB memory each. [...] initialized from 20 random starting points to avoid local minima.