Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Demystifying Language Model Forgetting with Low-rank Example Associations

Authors: Xisen Jin, Xiang Ren

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we empirically analyze forgetting that occurs in N upstream examples of language modeling or instruction-tuning after fine-tuning LLMs on one of M new tasks, visualized in M N matrices. We show that the matrices are often well-approximated with low-rank matrices, indicating the dominance of simple associations between the learned tasks and forgotten upstream examples. Leveraging the analysis, we predict forgetting of upstream examples when fine-tuning LLMs on unseen tasks with matrix completion over the empirical associations. This enables fast identification of most forgotten examples without expensive inference on the entire upstream data. Despite simplicity, the approach outperforms prior approaches that learn semantic relationships of learned tasks and upstream examples with LMs. We demonstrate the practical utility of our analysis by showing statistically significantly reduced forgetting as we upweight predicted examples for replay during fine-tuning.
Researcher Affiliation	Academia	Xisen Jin, Xiang Ren University of Southern California EMAIL
Pseudocode	No	The paper describes its methods and procedures in narrative text and mathematical formulations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Please checkout the attached supplementary material for the code and the forgetting statistics collected.
Open Datasets	Yes	Dataset and licenses. MMLU, BBH, and the Pile are released under MIT license. Truthful QA, Dolma, Redpajama, OLMo models, OLMo2 models, Pythia models, and MPT models are released under Apache 2.0 license. Tulu V2, OLMo2-Mix, and OLMo2-SFT-Mix are released under ODC-By license. Dolly is released under CC BY-SA 3.0 license.
Dataset Splits	Yes	To evaluate this, we create training and test splits by partitioning the set of fine-tuning tasks (noted as Ttrain and Ttest) and the rows of the association matrices Z. We further control whether Ttrain and Ttest belong to the same category of tasks to test both indomain and out-of-domain generalization ability of the prediction models. For OLMo-1B and 7B experiments, we use FLAN as in-domain tasks and Tulu and Dolly as out-of-domain testing tasks. For OLMo-7B-Instruct experiments, we use MMLU, BBH, OLMo2-SFT-Mix as indomain tasks and use Truthful QA and Dolly as out-of-domain testing tasks. Details about the tasks included in the training, in-domain testing, and out-of-domain testing sets are discussed in Tables 15 and 16 in Appendix D. ... For in-domain test splits, we randomly sample 30 upstream examples and assume the ground truth forgetting is known for these examples. This is required for predicting forgetting on the rest of upstream examples by additive linear, MF, and KNN methods. We repeat the experiment 10 times.
Hardware Specification	Yes	Computational Infrastructure. We used 4 Quadro RTX A6000 GPUs for fine-tuning LLMs, and used 1 Quadro RTX A6000 GPU for LLM inference.
Software Dependencies	No	We use Hugging Face Transformers library for training and VLLM library for efficient inference. The paper mentions software libraries used but does not provide specific version numbers for these components, which is required for a reproducible description.
Experiment Setup	Yes	For full-parameter fine-tuning of non-instruction-tuned LLMs of all types, we train the model for 1,000 steps with an effective batch size of 8 and a linearly decaying learning rate of 2e-6. The learning rate is chosen among {1e-6, 2e-6, 5e-6} that achieves the the best average validation perplexity after fine-tuning OLMo-7B on 5 randomly chosen tasks from FLAN. For OLMo-7B-Instruct and MMLU, BBH, Truthful QA and Dolly, considering the small size of the training sets, we train the models only for 100 steps with an effective batch size of 8. For OLMo2-SFT-Mix tasks, we train the model for 1,000 steps.