Scalable AI Safety via Doubly-Efficient Debate

Authors: Jonah Brown-Cohen, Geoffrey Irving, Georgios Piliouras

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this paper, we show how to address these challenges by designing a new set of debate protocols where the honest strategy can always succeed using a simulation of a polynomial number of steps, whilst being able to verify the alignment of stochastic AI systems, even when the dishonest strategy is allowed to use exponentially many simulation steps. We introduce the theoretical model of doubly-efficient debate, where two polynomial-time provers compete with each other in order to convince a much more efficient verifier of the correctness of a computation that depends on access to black-box judgements. In this model we prove that, under appropriate assumptions, any polynomial-time computation can be verified using only a constant number of queries to the black-box representing human judgement (and in time linear in the size of a single query). Our definition of doubly-efficient debate is a complexity-theoretic formalization of a training setup in which two competing AI models attempt to convince a verifier, who has access to human judgements, of the correctness of a solution to a computational problem. In the subsequent sections we prove theorems achieving this high-level goal in several settings.
Researcher Affiliation Industry 1Google DeepMind, London, UK. Correspondence to: Jonah Brown-Cohen <jonahbc@google.com>.
Pseudocode Yes Figure 1. Doubly-efficient debate protocol with cross-examination for time T. Figure 2. Doubly-efficient debate protocol for a stochastic oracle. Figure 3. Doubly-efficient debate protocol with a witness. Figure 4. Doubly-efficient debate protocol for time T and space S.
Open Source Code Yes Lean 4 formalization. https://github.com/google-deepmind/debate formalizes Theorem 6.2 using Lean 4 and Mathlib (Moura & Ullrich, 2021).
Open Datasets No This paper is theoretical and focuses on mathematical proofs and formal models. It does not describe any experiments that were run by the authors and therefore does not provide information about publicly available or open datasets used for training.
Dataset Splits No This paper is theoretical and focuses on mathematical proofs and formal models. It does not describe any experiments that were run by the authors and therefore does not provide information about dataset splits for training, validation, or testing.
Hardware Specification No This paper is theoretical and focuses on mathematical proofs and formal models. It does not describe any experiments that were run by the authors and therefore does not specify any hardware used.
Software Dependencies Yes Lean 4 formalization. https://github.com/google-deepmind/debate formalizes Theorem 6.2 using Lean 4 and Mathlib (Moura & Ullrich, 2021). (The purpose of the formalization was primarily to learn Lean 4.) Our algorithm differs from Figure 2 in a few details; e.g., we use external randomness rather than z A t + z B t , and the constants are different (if unsure, trust the formalized constants). We formalize the debate protocol using two monads: Prob a for finitely supported probability distributions over a type a, and Comp s a for stochastic computations that can make queries to any oracle in a set s.
Experiment Setup No This paper is theoretical and focuses on mathematical proofs and formal models. It does not describe any experiments that were run by the authors and therefore does not provide details about an experimental setup, hyperparameters, or training settings.