Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
On scalable oversight with weak LLMs judging strong LLMs
Authors: Zachary Kenton, Noah Siegel, Janos Kramar, Jonah Brown-Cohen, Samuel Albanie, Jannis Bulian, Rishabh Agarwal, David Lindner, Yunhao Tang, Noah Goodman, Rohin Shah
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We benchmark on a diverse range of asymmetries between judges and agents, extending previous work on a single extractive QA task with information asymmetry, to also include mathematics, coding, logic and multimodal reasoning asymmetries. We perform a large-scale evaluation sweeping over 9 tasks, each sampling 128 questions, totalling approximately 5 million model generation calls, affording us insight on which aspects of our study are practically significant. |
| Researcher Affiliation | Industry | All authors Google DeepMind. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | We do not include our code but could aim to at a later date. |
| Open Datasets | Yes | QuALITY [32], BoolQ [11], GPQA-extractive [39], MMLU [20], GSM8KQA [12], PrOnto QA [41], Truthful QA [29], GPQA [39], MMMU [48] |
| Dataset Splits | No | The paper mentions sweeping over 9 tasks, each sampling 128 questions for evaluation, but does not specify explicit training, validation, and test dataset splits with percentages or sample counts for its own experimental setup. |
| Hardware Specification | No | This is an evaluation paper which requires access to server APIs rather than own compute so we don t report the compute resources used. |
| Software Dependencies | No | The paper mentions using 'Scipy s permutation_test' but does not provide specific version numbers for software libraries, frameworks, or programming languages used in the experiments. |
| Experiment Setup | Yes | For these results we select the following settings: for consultancy/debate, we use Pro 1.5 as consultant/debaters and have 3 rounds of interaction. For debate, we use simultaneous turns with debaters selecting their responses through Best-of-4: 4 samples are independently generated, and Pro 1.5 is prompted to select the most persuasive one (more details in Appendix F). Judges are 0-shot prompted to predict the answer given the protocol transcript. Models are used 1-shot, with default sampling options unless otherwise specified. Our prompts are adapted from Khan et al. [25] with a few modifications: changed 'quote' to 'passage'... |