Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Generalization vs Specialization under Concept Shift
Authors: Alex Nguyen, David J Schwab, Vudtiwat Ngampruetikorn
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our theoretical results are in good agreement with experiments based on transformers pretrained to solve linear regression; under concept shift, too long context length can be detrimental to generalization performance of next token prediction. Finally, experiments on MNIST and Fashion MNIST further validate our theoretical predictions, suggesting these phenomena represent a fundamental aspect of learning under distribution shift. |
| Researcher Affiliation | Academia | Alex Nguyen Princeton University David J. Schwab* CUNY Graduate Center Vudtiwat Ngampruetikorn* University of Sydney |
| Pseudocode | No | The paper describes methods and derivations mathematically and textually, but it does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: While we do not have open-source code for now, we will open source our code once the anonymity period for submission is over. |
| Open Datasets | Yes | Finally, experiments on MNIST and Fashion MNIST further validate our theoretical predictions, suggesting these phenomena represent a fundamental aspect of learning under distribution shift. We consider standard multinomial logistic regression for MNIST [40] and Fashion MNIST [41], using Adam optimizer with a minibatch size of 500 and a learning rate of 0.001 for 2,000 epochs. |
| Dataset Splits | Yes | In our experiment, a transformer takes as its input a series of points (π₯π, π¦π) on an unknown function π¦π= π(π₯π) for π= 1, 2, . . . , π 1, terminating with a query π₯πwhose function value π¦πis the prediction target. ... At test time, we sample 10,000 new tasks and compute the in-distribution prediction risk simply as the MSE of the transformer on test tasks. ... To vary training sample size π, we choose training data points at random (without replacement); all of the training data is used when π=60,000. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, memory amounts, or detailed computer specifications used for running the experiments. |
| Software Dependencies | No | The model is trained to minimize the next token mean squared error (MSE) using Adam with a learning rate of 0.0001. ... We consider standard multinomial logistic regression for MNIST [40] and Fashion MNIST [41], using Adam optimizer with a minibatch size of 500 and a learning rate of 0.001 for 2,000 epochs. The paper mentions the 'Adam' optimizer but does not specify its version or any other software dependencies with version numbers. |
| Experiment Setup | Yes | We consider linear regression in π=32 dimensions. ... We choose π2 =0.5. ... We adopt the nano GPT architecture [39] with eight layers, an embedding dimension of 128, learnable position embeddings, and causal masking. The model is trained to minimize the next token mean squared error (MSE) using Adam with a learning rate of 0.0001. ... We consider standard multinomial logistic regression for MNIST [40] and Fashion MNIST [41], using Adam optimizer with a minibatch size of 500 and a learning rate of 0.001 for 2,000 epochs. |