Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Mitigating Forgetting in LLM Fine-Tuning via Low-Perplexity Token Learning

Authors: Chao-Chung Wu, Zhi Rui Tam, Chieh-Yen Lin, Yun-Nung (Vivian) Chen, Shao-Hua Sun, Hung-yi Lee

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across different model families and scales, including Gemma 2 IT 2B, Llama 3 8B Instruct, and three additional models, agree with our findings. To the best of our knowledge, this is the first work to provide an empirical explanation based on token perplexity reduction to mitigate catastrophic forgetting in LLMs after fine-tuning, offering valuable insights for developing more robust fine-tuning strategies.
Researcher Affiliation	Collaboration	1Appier AI Research, 2National Taiwan University EMAIL
Pseudocode	No	The paper describes methods and strategies but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	Our code and datasets are available at https://github.com/appier-research/robust-llm-finetunes
Open Datasets	Yes	For target-task fine-tuning, we adopt the MBPP and MATH datasets. These datasets provide rich annotations, including full solutions, reasoning steps, and test cases, beyond just the final answers, allowing models to learn comprehensive task-solving behaviors. For evaluation, in addition to using the test sets from MBPP and MATH, we assess model performance on GSM8K [7], ARCChallenge [6], and BIRD [19] to examine generalization to non-target tasks involving various forms of reasoning and generation. A detailed description of the datasets is provided in Section 4. Our code and datasets are available at https://github.com/appier-research/robust-llm-finetunes.
Dataset Splits	Yes	MBPP: A Python programming benchmark containing 974 problem-solution pairs. We partition training set into 374 train, 90 validation and using the original 378 test examples. Performance is evaluated using the pass@1 metric. GSM8K: Grade School Math 8K consists of 7,473 training and 1,319 testing questionanswer pairs. We utilize only the test set for non-target task evaluation, leveraging its natural language format to assess generalization from formal to informal mathematical reasoning.
Hardware Specification	Yes	We train Llama 3 8B Instruct, Mistral 7B Instruct and Gemma 2 IT 2B using two NVIDIA A100 Tensor Core GPU with VRAM 40GB for each GPU and additional RAM of 96GB.
Software Dependencies	No	The paper mentions various LLM models used (e.g., Llama 3 8B Instruct, Gemma 2 IT 2B) and references to their original papers, but does not explicitly list specific software dependencies (e.g., Python, PyTorch, CUDA versions) used for the experiments.
Experiment Setup	Yes	Following recent findings [4] that Lo RA [13] effectively mitigates performance degradation, we standardize our experiments by using Lo RA for all fine-tuning. We then vary training strategies and data sources to study their impact on performance preservation. Details of our evaluation prompts and training configurations are provided in Appendix F. Table 5: Learning rate (lr) sweeping results of training Llama 3 8B Instruct (Llama 3 8B-IT) and Gemma 2 IT 2B on the MBPP dataset.