Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Scaling Laws For Scalable Oversight

Authors: Joshua Engels, David Baek, Subhash Kantamneni, Max Tegmark

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our framework with a modified version of the game Nim and then apply it to four oversight games: Mafia, Debate, Backdoor Code and Wargames. For each game, we find scaling laws that approximate how domain performance depends on general AI system capability. We then build on our findings in a theoretical study of Nested Scalable Oversight (NSO), a process in which trusted models oversee untrusted stronger models, which then become the trusted models in the next step. We identify conditions under which NSO succeeds and derive numerically (and in some cases analytically) the optimal number of oversight levels to maximize the probability of oversight success. We also apply our theory to our four oversight games, where we find that NSO success rates at a general Elo gap of 400 are 13.5% for Mafia, 51.7% for Debate, 10.0% for Backdoor Code, and 9.4% for Wargames; these rates decline further when overseeing stronger systems.
Researcher Affiliation Academia Joshua Engels MIT EMAIL David D. Baek MIT EMAIL Subhash Kantamneni MIT EMAIL Max Tegmark MIT EMAIL
Pseudocode Yes In Algorithm 1, Algorithm 2, Algorithm 3, and Algorithm 4 we show the pseudocode for how each of Mafia, Debate, Backdoor Code, and Wargames, respectively, are run and evaluated.
Open Source Code Yes Our anonymous code is available at https://anonymous.4open.science/ r/oversight-scaling-laws-6276.
Open Datasets Yes We prompt a Houdini model to insert a subtle backdoor into a Python 3 program drawn from the APPS dataset (Hendrycks et al., 2021) at the competition level. See Listing 1 for an example, and Appendix A.3 for full pipeline details.
Dataset Splits No The paper uses existing datasets such as APPS, Truthful QA, Bool Q, QuaLITY, and PrOnto QA. It mentions drawing 225 Python 3 samples from APPS and deriving 20 question-answer pairs from other datasets. However, it does not specify traditional training/validation/test splits for these datasets within the context of the experiments conducted. The LLMs are pre-trained models and the datasets are used as contexts or problems for the LLMs to interact with, rather than for training or evaluating splits of the LLMs themselves in a supervised learning paradigm.
Hardware Specification No All of our experiments use model APIs and thus do not use GPU or CPU resources. Running all of our experiments takes about an hour when parallelized with multiple API calls.
Software Dependencies No The paper mentions 'Python 3 program' in the context of the Backdoor Code game. However, it does not specify any other ancillary software dependencies (e.g., libraries, frameworks) with specific version numbers used for their experimental setup or analysis. The LLMs themselves are the subjects of the experiments, accessed via APIs.
Experiment Setup Yes In our simulation setting, the game is played with five villagers (Guards) and one Mafia (Houdini). Players are secretly assigned roles at the beginning, and villager players do not know who the Mafia is. The game alternates between day and night phases; at night, the Mafia secretly chooses one player to eliminate, and during the day, all players discuss and vote to eliminate someone they suspect to be Mafia. Each day of discussion has three phases, and each alive player speaks once per phase. Within each phase, the speaking order is randomized. In the first phase of the first day, players are asked to introduce themselves. In the third phase of each day s discussion, players are asked to wrap up the discussion and start to decide on the vote. The goal of Mafia players is to eliminate enough villagers to achieve parity, while villagers aim to identify and eliminate all Mafia members. The game ends when either side achieves their goal.