Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring
Authors: Benjamin Arnav, Pablo Bernabeu-Perez, Nathan Helm-Burger, Timothy H. Kostolansky, Hannes Whittingham, Mary Phuong
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare Co T monitoring to action-only monitoring, where only final outputs are reviewed, in a red-teaming setup where the untrusted model is instructed to pursue harmful side tasks while completing a coding problem. We find that while Co T monitoring is more effective than overseeing only model outputs in scenarios where action-only monitoring fails to reliably identify sabotage, reasoning traces can contain misleading rationalizations that deceive the Co T monitors, reducing performance in obvious sabotage cases. |
| Researcher Affiliation | Industry | Benjamin Arnav 1 Pablo Bernabeu-PΓ©rez 1 Nathan Helm-Burger 1 Tim Kostolansky 1 Hannes Whittingham 1 Mary Phuong 1 LASR Labs *Core contributors. All correspondence to EMAIL. |
| Pseudocode | No | The paper describes methodologies and evaluation setups but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at: https://github.com/cotredhanded/code. |
| Open Datasets | Yes | We use problems from Big Code Bench (Zhuo et al., 2024) as our main task dataset. Big Code Bench is a benchmark for code generation that contains 1,140 function-level tasks requiring a combination of multiple function calls from 139 Python libraries. We use the Instruct variant of the dataset, which presents the coding problems as short natural language instructions. |
| Dataset Splits | Yes | We split the Big Code Bench dataset into four partitions: Prompt, Train, Validation and Test, following 5%/10%/10%/75% splits respectively. |
| Hardware Specification | Yes | The paper reports estimated compute usage in Appendix D, including the number of inference calls, APIs used (Anthropic, Google, Groq, Open AI, Together), model configurations, and compute requirements for a targeted fine-tuning experiment with Deep Seek R1 Distill Qwen 14B to simulate advanced attack policies. |
| Software Dependencies | No | The paper mentions using the "Inspect AI framework (UK AI Safety Institute, 2024)" and lists various models from providers like Anthropic, DeepMind, OpenAI, DeepSeek-AI, and xAI. However, it does not specify software dependency versions (e.g., Python 3.x, PyTorch 1.x, CUDA 11.x). |
| Experiment Setup | Yes | To optimize the hybrid monitor s performance, we evaluate weighted combinations of action-only and Co T suspicion scores. For each weight w [0, 1] at regular intervals of 0.01, hybrid scores were computed as: shybrid = w saction-only + (1 w) s Co T. For each w, we determine the 99th percentile threshold using benign examples (main-task-only completions, with no side instruction) to maintain a 1% FPR. We then calculate the detection rate for each weighting using the baseline attack policy. The best results are obtained with w = 0.55, which we adopt to compute the final results. |