Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Context-Parametric Inversion: Why Instruction Finetuning May Not Actually Improve Context Reliance
Authors: Sachin Goyal, Christina Baek, Zico Kolter, Aditi Raghunathan
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | However, we observe a surprising failure mode: during instruction tuning, the context reliance under knowledge conflicts initially increases as expected, but then gradually decreases as instruction finetuning progresses. ... We perform various controlled studies and theoretical analysis to show that context-parametric inversion occurs... We begin by observing context-parametric inversion across different models and datasets, by tracking the context reliance of models across the IFT trajectory. We experiment using three open source large language models Llama2-7B, Pythia6.9B, and Mistral7B. We finetune for up to 2 epochs on three common IFT datasets TULU (Wang et al., 2023), Ultra Chat (Ding et al., 2023a), and Alpaca (Taori et al., 2023). |
| Researcher Affiliation | Academia | Carnegie Mellon University EMAIL |
| Pseudocode | No | The paper presents mathematical equations and derivations in Section 5 "THEORETICAL ANALYSIS OF CONTEXT-VS-PARAMETRIC RELIANCE" (e.g., Equation 1, 4, 5) but does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | All datasets are available at https://github.com/locuslab/context-parametric-inversion. This link refers to the datasets, not the source code for the experimental methodology described in the paper. There is no explicit statement about releasing the code for their methodology. |
| Open Datasets | Yes | We experiment using three open source large language models Llama2-7B, Pythia6.9B, and Mistral7B. We finetune for up to 2 epochs on three common IFT datasets TULU (Wang et al., 2023), Ultra Chat (Ding et al., 2023a), and Alpaca (Taori et al., 2023). All datasets are available at https://github.com/locuslab/context-parametric-inversion. We track the progress of IFT based on the performance on four standard benchmarks GSM8k (Cobbe et al., 2021) (math), MMLU (Hendrycks et al., 2021) (general fact recall), SQu AD (Rajpurkar et al., 2016) (reading comprehension), and ARC-Challenge (Clark et al., 2018) (reasoning). |
| Dataset Splits | No | The paper mentions 'We finetune for up to 2 epochs on three common IFT datasets' and 'we evaluated every 50 steps on the knowledge conflict datasets introduced earlier.' and 'For tracking finetuning progress, we use the average performance across four standard benchmarks'. It also mentions filtering 25% of Alpaca datapoints in Section 4.4 but does not specify exact train/validation/test splits, percentages, or absolute sample counts for any dataset used in their experiments. |
| Hardware Specification | No | The paper does not contain specific details about the hardware (e.g., GPU models, CPU types, memory amounts) used for running the experiments. |
| Software Dependencies | No | We use Allen AI Open Instruct (Wang et al., 2023) framework for instruction finetuning and lm-eval-harness (Gao et al., 2024) for all the evaluations. Specific version numbers for these frameworks or other core software libraries (like Python, PyTorch/TensorFlow, CUDA) are not provided. |
| Experiment Setup | Yes | We finetune for up to 2 epochs on three common IFT datasets... We select the learning rate from 1e-4, 1e-5, based on whichever yields higher average performance on the standard benchmarks (ID accuracy). Unless otherwise specified, we use Lo RA with rank 128 for SFT. |