Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

In-Context Learning Strategies Emerge Rationally

Authors: Daniel Wurgaft, Ekdeep S Lubana, Core Francisco Park, Hidenori Tanaka, Gautam Reddy, Noah D. Goodman

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Across three distinct settings, we replicate well-known ICL phenomena [24, 25] and show models primarily transition between two ICL phases, determined by behaviorally matching one of two Bayesian predictors: 1) a memorizing predictor with a discrete prior over seen tasks or 2) a generalizing predictor with a continuous prior over the true data-generating distribution. ... We derive a closed-form expression for our model, which almost perfectly predicts Transformer next-token predictions throughout training without access to weights, as well as captures varied ICL phenomena including task diversity effects and transience.
Researcher Affiliation	Collaboration	Daniel Wurgaft1,2,3 Ekdeep Singh Lubana2,3 Core Francisco Park2 Hidenori Tanaka2,3 Gautam Reddy4 Noah D. Goodman1,5 1Department of Psychology, Stanford University 2CBS-NTT Program in Physics of Intelligence, Harvard University 3Physics of Artificial Intelligence Group, NTT Research, Inc., Sunnyvale, CA, USA 4Joseph Henry Laboratories of Physics, Princeton University 5Department of Computer Science, Stanford University Equal contribution. Email: EMAIL, EMAIL.
Pseudocode	No	The paper describes methods and derivations mathematically and narratively (e.g., Section 2 "Preliminaries: Learning a Finite Mixture of Tasks", Section 3 "What Strategies: Memorizing and Generalizing Predictors", Section 4 "Answering the Why: A Hierarchical Bayesian Account of ICL", and Appendix D "Derivations"), but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	All code used to run the experiments and analysis, as well as evaluation metrics for all settings, is available at: https://github.com/DanielWurgaft/rational-icl
Open Datasets	No	We analyze three distinct instantiations of this general formulation: Balls & Urns, which captures the belief update interpretation of ICL and is a simplification of the Markov modeling setting from prior work [26, 52], and two popularly studied settings from the literature that capture the few-shot learning interpretation of ICL, i.e., in-context linear regression [14, 25] and Classification [20, 24, 28, 29]. These are problem settings or synthetic data generation processes, not pre-existing publicly available datasets.
Dataset Splits	Yes	For OOD evaluation of both the Transformer and our procedurally defined predictors, i.e., the memorizing predictor M and generalizing predictor G, we draw 500 sequences from 500 unseen tasks (however, following still the same task distribution Ttrue). In comparison, ID evaluation involves 500 sequences from seen tasks. If task-diversity D is less than 500, sequences from the same task may be seen multiple times. ... To fit the 3 free parameters of the Bayesian model, we minimize the mean KL divergence (or mean-squared error in the linear-regression setting) between the interpolated predictions and the Transformer outputs. Optimization is performed with scipy.optimize.minimize using the L-BFGS-B algorithm, capped at 1K iterations and 2K function evaluations, with gradient and function tolerances of 10 7. Exact gradients are supplied via Py Torch s automatic differentiation, ensuring stable convergence. For each task we fit on 80 % of the (N, D) configuration grid and reserve the remaining 20 % for held-out validation and diagnostic checks.
Hardware Specification	Yes	All models are trained on A100 GPUs, with maximum training budget reaching 2 days for all experiments encompassing the linear regression setting.
Software Dependencies	No	For all settings, we use the GPT-Neo X architecture sourced from Huggingface [56, 57]. ... Optimization is performed with scipy.optimize.minimize using the L-BFGS-B algorithm... Exact gradients are supplied via Py Torch s automatic differentiation, ensuring stable convergence. The paper mentions software like Huggingface, Py Torch, and scipy, but does not provide specific version numbers for these dependencies, which are necessary for reproducible descriptions.
Experiment Setup	Yes	For all settings, we use the GPT-Neo X architecture sourced from Huggingface [56, 57]. While the number of layers / blocks in the model depend on the specific experimental setting (as reported below), we use only 1 attention head per layer and follow a sequential residual stream architecture across all settings. Training. We use the Huggingface trainer with default parameters, changing only the learning rate, batch-size, total iterations, and warmup steps (reported below). Gradients are clipped to unitnorm. All models are trained on A100 GPUs, with maximum training budget reaching 2 days for all experiments encompassing the linear regression setting. We vary data-diversity D from {22, 24, . . . , 212} across all settings. Settings-Specific Details. ... Balls and Urns. Models of hidden dimension size 64 are trained for 100K steps, with no warmup steps, at a constant learning rate of 5 10 4 and batch-size of 64. For our analysis, we derive experimental settings from combinations of task-dimensionality (equivalent to vocabulary-size), which varies in the set {8, 12, 16}; context length, which varies in the set {128, 256, 320}; and MLP expansion factor, which varies in the set {0.5, 4, 8}.