Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Large Language Models as Model Organisms for Human Associative Learning

Authors: Camila Kolling, Vy Vo, Mariya Toneva

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Building on LLMs in-context learning, we adapt a cognitive neuroscience associative learning paradigm and investigate how representations evolve across six models. Our initial ﬁndings reveal a non-monotonic pattern consistent with the Non-Monotonic Plasticity Hypothesis, with moderately similar items differentiating after learning. Leveraging the controllability of LLMs, we further show that this differentiation is modulated by the overlap of associated items with the broader vocabulary a factor we term vocabulary interference, capturing how new associations compete with prior knowledge.
Researcher Affiliation	Academia	Camila Kolling Vy Ai Vo Mariya Toneva Max Planck Institute for Software Systems, Saarbrücken, Germany EMAIL
Pseudocode	No	The paper describes methods like the Greedy Coordinate Gradient (GCG) algorithm in prose (Section 3.2), but does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code available at github.com/bridge-ai-neuro/llm-associative-learning. We have attached with the submission the code necessary to reproduce our main results and upon acceptance we will publicly release it.
Open Datasets	No	The paper describes generating 'token pairs' and sampling from LLM vocabularies: 'To systematically ﬁnd tokens whose pair similarity before learning falls within a given interval, we employ an efﬁcient way for searching the large vocabulary space...' and 'We randomly sample 1,000 tokens from Vm to form the representative subset Vm'. It does not refer to a pre-existing public dataset nor provides concrete access information for the generated token pairs as a public dataset.
Dataset Splits	No	The paper discusses how token pairs are selected and grouped into 'similarity groups' and 'vocabulary interference groups' for experimental purposes, but it does not describe training/test/validation splits for a dataset in the context of model training, as the LLMs used are pre-trained. The LLMs themselves are the subjects of the experiments, and the 'data' (token pairs) are generated for testing.
Hardware Specification	Yes	All experiments were performed on internal compute clusters, using two NVIDIA H100 PCIe GPUs with 80GB GPU memory per device.
Software Dependencies	No	The paper lists the LLM models used (Llama2-7b, Llama3.1-8b, Llama3.2-1b, Llama3.2-3b, Gemma2-9b, and Mistral-7b), but it does not specify software dependencies such as programming language versions, specific libraries (e.g., PyTorch, TensorFlow), or their corresponding version numbers.
Experiment Setup	Yes	Our associative learning paradigm is inspired by the experimental design of [46]... Formally, we present the token pair (x, y) a total of r 1 times, followed by one ﬁnal presentation of x alone as a cue for predicting its paired token y. Given the input sequence s = [x1, y1, x2, y2, . . . , xr 1, yr 1, xr]... We extract the hidden representations of a token x at the last layer of the model... We sample evenly along the cosine similarity axis, deﬁning 17 groups g that fall within the interval [0.1, 0.95)... We analyze six recent open-source base LLMs: Llama2-7b, Llama3.1-8b, Llama3.2-1b, Llama3.2-3b, Gemma2-9b, and Mistral-7b.