Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Combining Cost Constrained Runtime Monitors for AI Safety

Authors: Tim Hua, James Baskerville, Henri Lemoine, Mia Hopman, Aryan Bhatt, Tyler Tracy

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical demonstrations of significant performance gains compared to baselines. We empirically evaluate our framework by implementing a system to detect backdoored code in AI-generated solutions to programming problems (Greenblatt et al., 2024). Our experiments reveal that: (1) likelihood-ratio-based policies outperform naive baselines; (2) stochastic monitor calls can significantly outperform always-on monitoring by strategically trading off spending on monitors against spending on the safety intervention; and (3) combining two monitors (Llama 3.1 70B and 8B) yields Pareto improvements over any single-monitor policy, as shown in Figure 1.
Researcher Affiliation	Collaboration	Tim Tian Hua MARS EMAIL James Baskerville MARS Henri Lemoine MARS Mia Hopman MARS Aryan Bhatt Redwood Research EMAIL Tyler Tracy MARS & Redwood Research EMAIL. This work was supported through the MARS (Mentorship for Alignment Research Students) program at the Cambridge AI Safety Hub (caish.org/mars).
Pseudocode	Yes	Algorithm 1 Finding optimal parameters for ΠSR policy
Open Source Code	Yes	our code can be found at this Github link: github.com/tim-hua01/combining_monitors_public.
Open Datasets	Yes	We prompt o1-mini to solve all 1,140 coding questions from the Big Code Bench dataset (Zhuo et al., 2025).
Dataset Splits	Yes	Unless otherwise noted, we average results over 500 random train/test splits of misaligned scores, with 218 in the train split and 100 samples in the test split to account for statistical uncertainty.
Hardware Specification	Yes	Hardware Specifications CPU: AMD EPYC 7413 24-Core Processor CPU Configuration: 2 sockets 24 cores per socket 2 threads per core = 96 total CPU threads CPU Max Frequency: 3.6 GHz Memory: 754 GB RAM NUMA Configuration: 8 NUMA nodes L1: 1.5 Mi B data + 1.5 Mi B instruction (48 instances) L2: 24 Mi B (48 instances) L3: 256 Mi B (8 instances)
Software Dependencies	No	The paper mentions 'scikit-learn' for kernel density estimation but does not provide specific version numbers for any software dependencies used in the experiments.
Experiment Setup	Yes	We sample each monitor five times, average the scores, and add minimal Gaussian noise to make the score distributions more continuous. Running the monitors on aligned outputs generates our aligned monitor distribution (and vice versa for misaligned outputs). We normalize Llama 3.1 70B to cost 1 unit per sample while Llama 3.1 8B costs 0.21 units, reflecting their relative real-world inference costs as monitors. Unless otherwise specified, we assume that auditing costs 20 times as much as a single call to the Llama 3.1 70B monitor. We fit two separate kernel density estimators for aligned and misaligned outputs. For a given monitor score, its log likelihood ratio is the log of its likelihood under the misaligned distribution minus the log of its likelihood under the aligned distribution.