Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Prior Forgetting and In-Context Overfitting
Authors: Sungyoon Lee
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To provide an analytical understanding of the learning dynamics of the ICL abilities, we investigate the in-context random linear regression problem with a simple linear-attention-based transformer, and define and disentangle the strengths of the task recognition and task learning abilities stored in the transformer model s parameters. We show that, during the pretraining phase, the model first learns the task learning and the task recognition abilities together in the beginning, but it (a) gradually forgets the task recognition ability to recall the priorly learned tasks and (b) relies more on the given context in the later phase, which we call (a) prior forgetting and (b) in-context overfitting, respectively. (...) Figure 2: Evolution of the two parameters, α and κ, for different b s. We train the two-parameter transformer using SGD with learning rate of 0.01 and batch size of 4,000. We also use n = 10, d = 5, and σ = 0.2, 0.4, 0.8, i.e., the task dispersion b = σ2d = 0.2, 0.8, 3.2 (from Left to Right). Top: Empirical results with SGD (solid lines) and theoretical results of (11) and (12) with gradient flow (dashed lines). |
| Researcher Affiliation | Academia | Sungyoon Lee Department of Computer Science Hanyang University EMAIL |
| Pseudocode | No | The paper describes mathematical derivations and theoretical models but does not present any structured pseudocode or algorithm blocks. It provides equations for the transformer model and training dynamics. |
| Open Source Code | Yes | Answer: [Yes] Justification: We modify the code from https://github.com/chengxiang/ Linear Transformer. See the supplemental material. |
| Open Datasets | No | We train a transformer with the training set, which consists of the input context matrices and the corresponding target responses. The input context matrix = X x(n+1) Y 0 = x(1) x(2) x(n) x(n+1) y(1) y(2) y(n) 0 R(d+1) (n+1) is generated by drawing n + 1 d-dimensional covariates x(i) and an in-context task vector w representing a linear function fw : x 7 w x and computing the target responses y(i) as follows: x(i) i.i.d. DX , w DW, y(i) = w x(i) (i = 1, , n + 1), where x(i), w Rd, y(i) R, X = [ X x(n+1)] Rd (n+1), X = [x(1) x(n)] Rd n, Y = [ Y 0] R1 (n+1), Y = [y(1) y(n)] R1 n. Here, the x(i) s for i n and x(n+1) are called the in-context covariates and the query input, respectively. |
| Dataset Splits | No | The paper describes a synthetic data generation process for in-context linear regression, defining training and generalization risks based on expected values over distributions. It specifies parameters for this generation (e.g., n = 10, d = 5) and training details like batch size, but does not specify explicit train/test/validation splits for a fixed dataset, as the data is continuously sampled from defined distributions. |
| Hardware Specification | Yes | Answer: [No] Justification: Our random linear regression experiments do not require a lot of resources. We used a single A40 GPU, but much smaller one would suffice. |
| Software Dependencies | No | The paper states, "We modify the code from https://github.com/chengxiang/ Linear Transformer. See the supplemental material." However, it does not explicitly list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, etc.) used for the modifications or running the experiments. |
| Experiment Setup | Yes | Figure 2: Evolution of the two parameters, α and κ, for different b s. We train the two-parameter transformer using SGD with learning rate of 0.01 and batch size of 4,000. We also use n = 10, d = 5, and σ = 0.2, 0.4, 0.8, i.e., the task dispersion b = σ2d = 0.2, 0.8, 3.2 (from Left to Right). Figure 3: ICL loss curves for different parameterizations (each row) and different b s (each column). (...) We use learning rate of 0.01 and batch size of 4,000. We use SGD for the two-parameter transformer, but use Adam W for the full-parameter transformer and practical models. |