Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Detecting High-Stakes Interactions with Activation Probes

Authors: Alex McKenzie, Urja Pawar, Phil Blandfort, William Bankes, David Krueger, Ekdeep S Lubana, Dmitrii Krasheninnikov

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate several probe architectures trained on synthetic data, and find them to exhibit robust generalization to diverse, out-of-distribution, real-world data. Probes performance is comparable to that of prompted or finetuned medium-sized LLM monitors, while offering computational savings of six orders-of-magnitude. Our experiments also highlight the potential of building resource-aware hierarchical monitoring systems
Researcher Affiliation Collaboration 1LASR Labs 2University College London 3MILA 4Harvard University 5NTT Research 6Goodfire 7University of Cambridge
Pseudocode No The paper only provides mathematical formulas for probe types (Table 3) and describes methods in prose, without structured pseudocode or algorithm blocks.
Open Source Code Yes We release our novel synthetic dataset and the codebase at https://github.com/arrrlex/models-under-pressure.
Open Datasets Yes We release our novel synthetic dataset and the codebase at https://github.com/arrrlex/models-under-pressure. For evaluation, we curated several external datasets, labelled for high/low-stakes; these external datasets help us explore the effectiveness of our high-stakes probes in realistic deployment scenarios. Table 2: Overview of evaluation datasets. We take datasets of human generated prompts from a variety of realistic deployment settings and label each dataset as high-or low-stakes using GPT-4o. As the data is human generated the data is OOD from the synthetic generated training dataset. We split the labeled examples into a development and test split, and subsample such that each consists of 50% high-stakes and 50% low-stakes samples. Details of our labelling process and further background about the datasets can be found in Appendix C.3. Dataset Name Types of Stakes Synthetic / Natural Dev Test Source Anthropic HH-RLHF harm to others natural 1028 2984 Bai et al. (2022) Tool ACE various synthetic 328 734 Liu et al. (2024) MT Samples medical natural 278 604 Boyle (2018) MTS Dialog medical natural 274 86 Ben Abacha et al. (2023) Mental Health harm to others, political natural 540 Sarkar (2023) Aya Redteaming various natural 1513 Aakanksha et al. (2024)
Dataset Splits Yes We split the labeled examples into a development and test split, and subsample such that each consists of 50% high-stakes and 50% low-stakes samples. Table 2: Overview of evaluation datasets. ... We split the labeled examples into a development and test split, and subsample such that each consists of 50% high-stakes and 50% low-stakes samples. Details of our labelling process and further background about the datasets can be found in Appendix C.3.
Hardware Specification Yes For computing activations, a machine with 2x H100 GPUs (80GB each) and 400GB of RAM was used. For training and evaluating probes with pre-computed activations, a machine with 1x A100 (40GB) and 200GB of RAM was sufficient. For finetuning baselines up to 8B parameters, a machine with 1x H100 (80GB) was used. Gemma3-12B was finetuned on a machine with 1x H100 NVL. Prompted baselines up to Gemma3-27B can be run on a single H100; for the Llama-3.3-70B prompted baseline we used a machine with 2x H100 (80GB each).
Software Dependencies No The paper mentions optimizers like Adam W and Adam W8Bit, and specific LLMs (Llama-3, Gemma-3 families), but does not provide specific version numbers for software libraries such as Python, PyTorch, or TensorFlow, which are necessary for full reproducibility.
Experiment Setup Yes Figure 8: Hyperparameters for various probes and finetuned baselines. This table provides detailed settings for batch size, epochs, learning rates, optimizers, and other critical parameters for both probe training and finetuned baselines.