Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Generating Computational Cognitive models using Large Language Models
Authors: Milena Rmus, Akshay Kumar Jagadish, Marvin Mathony, Tobias Ludwig, Eric Schulz
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Building on this potential, we introduce a pipeline for Guided generation of Computational Cognitive Models (Ge CCo). Given task instructions, participant data, and a template function, Ge CCo prompts an LLM to propose candidate models, fits proposals to held-out data, and iteratively refines them based on feedback constructed from their predictive performance. We benchmark this approach across four different cognitive domains decision making, learning, planning, and memory using three open-source LLMs, spanning different model sizes, capacities, and families. On four human behavioral data sets, the LLM generated models consistently matched or outperformed the best domain-specific models from the cognitive science literature. |
| Researcher Affiliation | Academia | Milena Rmus Helmholtz Munich EMAIL Akshay K. Jagadish Princeton University EMAIL Marvin Mathony Helmholtz Munich EMAIL Tobias Ludwig Tรผbingen University EMAIL Eric Schulz Helmholtz Munich EMAIL |
| Pseudocode | Yes | Figure 1: Schematic of Ge CCo: We prompt the LLM with a task description, participant data, guardrails to constrain the format of LLM responses, and the code template to generate cognitive models that offer different explanations of the underlying data as Python functions. |
| Open Source Code | Yes | The code for Ge CCo is available at https://github.com/Milena CCNlab/gecco.git |
| Open Datasets | Yes | On four human behavioral data sets, the LLM generated models consistently matched or outperformed the best domain-specific models from the cognitive science literature. |
| Dataset Splits | Yes | On each iteration, the base LLM was prompted using a fixed prompt structure (see Appendix 6) that included 1) a natural language task description, 2) behavioral data from a subset of participants (prompt data)... Each model was then fit to a second, held-out validation dataset (not shown in the prompt)... After completing all iterations in a run, the best LLM-generated model compared against competing cognitive models by evaluating on a third, held-out test set. |
| Hardware Specification | Yes | In practice, our approach takes a maximum of 8 hours per task domain on four Nvidia A100s with 40GB memory each. |
| Software Dependencies | No | Each model was then fit to a second, held-out validation dataset (not shown in the prompt) using the minimize function from the Sci Py optimization library (Virtanen et al., 2020) [...]. The paper mentions SciPy but no specific version number for SciPy or Python. |
| Experiment Setup | Yes | We set the temperature to 0.2 for Llama models, 0.15 for Qwen and 0.1 for R1 to encourage some exploration when generating models. In practice, our approach takes a maximum of 8 hours per task domain on four Nvidia A100s with 40GB memory each. [...] initialized from 20 random starting points to avoid local minima. |