Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Controllable Context Sensitivity and the Knob Behind It
Authors: Julian Minder, Kevin Du, Niklas Stoehr, Giovanni Monea, Chris Wendler, Robert West, Ryan Cotterell
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | When fine-tuned on this task, instruct versions of Llama-3.1, Mistral-v0.3, and Gemma-2 can solve it with high accuracy (85 95%). Analyzing these high-performing models, we narrow down which layers may be important to context sensitivity using a novel linear time algorithm. |
| Researcher Affiliation | Academia | DETH Zรผrich @EPFL NCornell University |
| Pseudocode | Yes | We provide Python-esque pseudocode for our search algorithm in App. A.1. Listing 1: Search Algorithm. |
| Open Source Code | Yes | We provide code to reproduce all datasets, experiments, and analysis at https://github.com/ kdu4108/context-vs-prior-finetuning. |
| Open Datasets | Yes | Following the task formulation in 3.1, we construct intent-augmented datasets, CCS-BF, CCS-MH, and CCS-AR, based on the query-context pairs in BASEFAKEPEDIA, MULTIHOPFAKEPEDIA (Monea et al., 2024), and ARITHMETIC. |
| Dataset Splits | Yes | Let Strn Q C and Stst Q C be disjoint training and testing sets of query context pairs. Models are trained on F (q, c, pri) a(q, ฮต) and F (q, c, ctx) a(q, c) for (q, c) Strn, where denotes concatenation. ... Training set size: 2048 examples. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU models, or cloud computing specifications used for running the experiments. |
| Software Dependencies | No | We build on pyvene (Wu et al., 2024) to train the projection. ... apply Py Torch s orthogonal parametrization4 to enforce orthonormal columns in A. The paper mentions `pyvene` and `PyTorch` but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | To fine-tune models in the CCS-BF task, we use QLo RA with the following hyperparameters: Effective batch size (after gradient accumulation): 16; Optimizer: Adam W (8-bit); Learning rate: 2e 4; QLo RA hyperparameters: attention head projection matrices in all layers; Training set size: 2048 examples. |