Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Generalization vs Specialization under Concept Shift

Authors: Alex Nguyen, David J Schwab, Vudtiwat Ngampruetikorn

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our theoretical results are in good agreement with experiments based on transformers pretrained to solve linear regression; under concept shift, too long context length can be detrimental to generalization performance of next token prediction. Finally, experiments on MNIST and Fashion MNIST further validate our theoretical predictions, suggesting these phenomena represent a fundamental aspect of learning under distribution shift.
Researcher Affiliation	Academia	Alex Nguyen Princeton University David J. Schwab* CUNY Graduate Center Vudtiwat Ngampruetikorn* University of Sydney
Pseudocode	No	The paper describes methods and derivations mathematically and textually, but it does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: While we do not have open-source code for now, we will open source our code once the anonymity period for submission is over.
Open Datasets	Yes	Finally, experiments on MNIST and Fashion MNIST further validate our theoretical predictions, suggesting these phenomena represent a fundamental aspect of learning under distribution shift. We consider standard multinomial logistic regression for MNIST [40] and Fashion MNIST [41], using Adam optimizer with a minibatch size of 500 and a learning rate of 0.001 for 2,000 epochs.
Dataset Splits	Yes	In our experiment, a transformer takes as its input a series of points (𝑥𝑖, 𝑦𝑖) on an unknown function 𝑦𝑖= 𝑓(𝑥𝑖) for 𝑖= 1, 2, . . . , 𝑛 1, terminating with a query 𝑥𝑛whose function value 𝑦𝑛is the prediction target. ... At test time, we sample 10,000 new tasks and compute the in-distribution prediction risk simply as the MSE of the transformer on test tasks. ... To vary training sample size 𝑁, we choose training data points at random (without replacement); all of the training data is used when 𝑁=60,000.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, memory amounts, or detailed computer specifications used for running the experiments.
Software Dependencies	No	The model is trained to minimize the next token mean squared error (MSE) using Adam with a learning rate of 0.0001. ... We consider standard multinomial logistic regression for MNIST [40] and Fashion MNIST [41], using Adam optimizer with a minibatch size of 500 and a learning rate of 0.001 for 2,000 epochs. The paper mentions the 'Adam' optimizer but does not specify its version or any other software dependencies with version numbers.
Experiment Setup	Yes	We consider linear regression in 𝑃=32 dimensions. ... We choose 𝜎2 =0.5. ... We adopt the nano GPT architecture [39] with eight layers, an embedding dimension of 128, learnable position embeddings, and causal masking. The model is trained to minimize the next token mean squared error (MSE) using Adam with a learning rate of 0.0001. ... We consider standard multinomial logistic regression for MNIST [40] and Fashion MNIST [41], using Adam optimizer with a minibatch size of 500 and a learning rate of 0.001 for 2,000 epochs.