Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

The Hawthorne Effect in Reasoning Models: Evaluating and Steering Test Awareness

Authors: Sahar Abdelnabi, Ahmed Salem

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present the first quantitative study of how such test awareness impacts model behavior, particularly its performance on safety-related tasks1. We introduce a white-box probing framework that (i) linearly identifies awareness-related activations and (ii) steers models toward or away from test awareness while monitoring downstream performance. We apply our method to different state-of-the-art openweight reasoning LLMs across both realistic and hypothetical tasks (denoting tests or simulations). Our results demonstrate that test awareness significantly impacts safety alignment (such as compliance with harmful requests and conforming to stereotypes) with effects varying in both magnitude and direction across models. By providing control over this latent effect, our work aims to provide a stress-test mechanism and increase trust in how we perform safety evaluations.
Researcher Affiliation	Collaboration	Sahar Abdelnabi1,2,3 Ahmed Salem1 1Microsoft 2ELLIS Institute Tübingen & MPI for Intelligent Systems 3Tübingen AI Center EMAIL EMAIL
Pseudocode	Yes	Algorithm 1 Training Test Awareness Probe Algorithm 2 Steering via Parameter Editing
Open Source Code	Yes	Code is available at: https://github.com/microsoft/Test_Awareness_Steering.
Open Datasets	Yes	We construct a synthetic dataset of single-turn tasks... We publicly release our codebase and synthetic datasets to enable further research. ... Using the trained probes, we apply steering to other held-out data: (1) the test split of our synthetic trigger dataset; (2) a random subset of Harm Bench [Mazeika et al., 2024]; (3) a one-turn action-evaluation dataset that we construct to evaluate actions such as invoking harmful tools; (4) the BBQ dataset [Parrish et al., 2022] to test other safety impacts, beyond agentic applications, such as conforming to stereotypes; and (5) general benchmarks for assessing reasoning and commonsense capabilities such as MMLU [Hendrycks et al., 2021], Hella Swag [Zellers et al., 2019], and Commonsens QA [Talmor et al., 2019].
Dataset Splits	Yes	Probe data. Building on the synthetic trigger dataset (Section 3.1) and the LLM judge s annotations (Section 3.2), we extract and label evidence snippets for evaluation awareness for each model. These snippets are then split into disjoint training and test sets for each model: 495/247 examples per class for Qw Q, 376/187 for Qwen3, and 215/107 for Distill-Qwen.
Hardware Specification	Yes	We perform all steering experiments on one machine with 4 A100 GPUs.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers.
Experiment Setup	Yes	Probe training. We split the dataset into training and test sets by prompt (rather than by snippet). For each layer l, we fit a linear classifier: P(y = aware \| T l) = Softmax(M l T l), M l R2 d. ... We train a separate linear probe on each model s annotated evidence snippets, using cross-entropy loss with SGD (learning rate = 0.008, momentum = 0.9) for up to 300 epochs. ... For steering to the test awareness class, we used α < 0.07, and for test unawareness , α > 0.35. We set top-K to 800.