Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring

Authors: Benjamin Arnav, Pablo Bernabeu-Perez, Nathan Helm-Burger, Timothy H. Kostolansky, Hannes Whittingham, Mary Phuong

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare Co T monitoring to action-only monitoring, where only final outputs are reviewed, in a red-teaming setup where the untrusted model is instructed to pursue harmful side tasks while completing a coding problem. We find that while Co T monitoring is more effective than overseeing only model outputs in scenarios where action-only monitoring fails to reliably identify sabotage, reasoning traces can contain misleading rationalizations that deceive the Co T monitors, reducing performance in obvious sabotage cases.
Researcher Affiliation Industry Benjamin Arnav 1 Pablo Bernabeu-PΓ©rez 1 Nathan Helm-Burger 1 Tim Kostolansky 1 Hannes Whittingham 1 Mary Phuong 1 LASR Labs *Core contributors. All correspondence to EMAIL.
Pseudocode No The paper describes methodologies and evaluation setups but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code is available at: https://github.com/cotredhanded/code.
Open Datasets Yes We use problems from Big Code Bench (Zhuo et al., 2024) as our main task dataset. Big Code Bench is a benchmark for code generation that contains 1,140 function-level tasks requiring a combination of multiple function calls from 139 Python libraries. We use the Instruct variant of the dataset, which presents the coding problems as short natural language instructions.
Dataset Splits Yes We split the Big Code Bench dataset into four partitions: Prompt, Train, Validation and Test, following 5%/10%/10%/75% splits respectively.
Hardware Specification Yes The paper reports estimated compute usage in Appendix D, including the number of inference calls, APIs used (Anthropic, Google, Groq, Open AI, Together), model configurations, and compute requirements for a targeted fine-tuning experiment with Deep Seek R1 Distill Qwen 14B to simulate advanced attack policies.
Software Dependencies No The paper mentions using the "Inspect AI framework (UK AI Safety Institute, 2024)" and lists various models from providers like Anthropic, DeepMind, OpenAI, DeepSeek-AI, and xAI. However, it does not specify software dependency versions (e.g., Python 3.x, PyTorch 1.x, CUDA 11.x).
Experiment Setup Yes To optimize the hybrid monitor s performance, we evaluate weighted combinations of action-only and Co T suspicion scores. For each weight w [0, 1] at regular intervals of 0.01, hybrid scores were computed as: shybrid = w saction-only + (1 w) s Co T. For each w, we determine the 99th percentile threshold using benign examples (main-task-only completions, with no side instruction) to maintain a 1% FPR. We then calculate the detection rate for each weighting using the baseline attack policy. The best results are obtained with w = 0.55, which we adopt to compute the final results.