Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

On the Loss of Context Awareness in General Instruction Fine-tuning

Authors: Yihan Wang, Andrew Bai, Nanyun Peng, Cho-Jui Hsieh

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on four context-dependent downstream tasks and three pre-trained LLMs of different sizes show that our method effectively mitigates the loss of context awareness without compromising general instruction-following capabilities.
Researcher Affiliation	Academia	Yihan Wang UCLA EMAIL Andrew Bai UCLA EMAIL Nanyun Peng UCLA EMAIL Cho-Jui Hsieh UCLA EMAIL
Pseudocode	No	The paper describes mathematical formulas and procedural steps in text (e.g., Section 2.3 and 3.1) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/Yihan Wang617/context_awareness.
Open Datasets	Yes	We experiment with three popular open-source instruction fine-tuning datasets: Share GPT, adopted by Vicuna [4], Ultra Chat-200k [7], and Wizard LM-70K [25]. ... In addition to NIH, we report the performance on three closed-book QA tasks to benchmark context awareness: SQu AD [21], Qu AC [5], and DROP [8].
Dataset Splits	No	For the instruction fine-tuning datasets (Share GPT, Ultra Chat-200k, Wizard LM-70K), the paper mentions fine-tuning models on them (e.g., "fine-tune the models for one epoch on Share GPT and Ultra Chat-200K"), but does not specify how these datasets were explicitly split into training, validation, or test sets by the authors for their experiments. For the NIH evaluation, it specifies "400 NIH tests with different insertion locations and context lengths", but this is for evaluation setup, not dataset splits for training.
Hardware Specification	Yes	All experiments are conducted on 4 A6000 GPUs on a local server.
Software Dependencies	No	The paper mentions using "fine-tuning recipes from the Huggingface alignment-handbook2" and "the fine-tuning recipe provided by the author3" for Tiny Llama, but does not provide specific version numbers for these frameworks or any other software dependencies like Python, PyTorch, or CUDA versions.
Experiment Setup	Yes	Detailed hyperparameters can be found in Appendix A.1.1. ... Table 4: Fine-tuning hyperparameters configuration Models Fine-tune config Learning rate Batch size Precision Tiny Llama Full fine-tune 2e-5 128 bf16 Llama-2/3 QLo RA with rank = 16, alpha =16 2e-4 64 bf16. ... We set the threshold for context-awareness as β = 0.6 for all experiments reported in Table 3.