Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Context-Parametric Inversion: Why Instruction Finetuning May Not Actually Improve Context Reliance

Authors: Sachin Goyal, Christina Baek, Zico Kolter, Aditi Raghunathan

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	However, we observe a surprising failure mode: during instruction tuning, the context reliance under knowledge conflicts initially increases as expected, but then gradually decreases as instruction finetuning progresses. ... We perform various controlled studies and theoretical analysis to show that context-parametric inversion occurs... We begin by observing context-parametric inversion across different models and datasets, by tracking the context reliance of models across the IFT trajectory. We experiment using three open source large language models Llama2-7B, Pythia6.9B, and Mistral7B. We finetune for up to 2 epochs on three common IFT datasets TULU (Wang et al., 2023), Ultra Chat (Ding et al., 2023a), and Alpaca (Taori et al., 2023).
Researcher Affiliation	Academia	Carnegie Mellon University EMAIL
Pseudocode	No	The paper presents mathematical equations and derivations in Section 5 "THEORETICAL ANALYSIS OF CONTEXT-VS-PARAMETRIC RELIANCE" (e.g., Equation 1, 4, 5) but does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	All datasets are available at https://github.com/locuslab/context-parametric-inversion. This link refers to the datasets, not the source code for the experimental methodology described in the paper. There is no explicit statement about releasing the code for their methodology.
Open Datasets	Yes	We experiment using three open source large language models Llama2-7B, Pythia6.9B, and Mistral7B. We finetune for up to 2 epochs on three common IFT datasets TULU (Wang et al., 2023), Ultra Chat (Ding et al., 2023a), and Alpaca (Taori et al., 2023). All datasets are available at https://github.com/locuslab/context-parametric-inversion. We track the progress of IFT based on the performance on four standard benchmarks GSM8k (Cobbe et al., 2021) (math), MMLU (Hendrycks et al., 2021) (general fact recall), SQu AD (Rajpurkar et al., 2016) (reading comprehension), and ARC-Challenge (Clark et al., 2018) (reasoning).
Dataset Splits	No	The paper mentions 'We finetune for up to 2 epochs on three common IFT datasets' and 'we evaluated every 50 steps on the knowledge conflict datasets introduced earlier.' and 'For tracking finetuning progress, we use the average performance across four standard benchmarks'. It also mentions filtering 25% of Alpaca datapoints in Section 4.4 but does not specify exact train/validation/test splits, percentages, or absolute sample counts for any dataset used in their experiments.
Hardware Specification	No	The paper does not contain specific details about the hardware (e.g., GPU models, CPU types, memory amounts) used for running the experiments.
Software Dependencies	No	We use Allen AI Open Instruct (Wang et al., 2023) framework for instruction finetuning and lm-eval-harness (Gao et al., 2024) for all the evaluations. Specific version numbers for these frameworks or other core software libraries (like Python, PyTorch/TensorFlow, CUDA) are not provided.
Experiment Setup	Yes	We finetune for up to 2 epochs on three common IFT datasets... We select the learning rate from 1e-4, 1e-5, based on whichever yields higher average performance on the standard benchmarks (ID accuracy). Unless otherwise specified, we use Lo RA with rank 128 for SFT.