On scalable oversight with weak LLMs judging strong LLMs

Authors: Zachary Kenton, Noah Siegel, Janos Kramar, Jonah Brown-Cohen, Samuel Albanie, Jannis Bulian, Rishabh Agarwal, David Lindner, Yunhao Tang, Noah Goodman, Rohin Shah

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We benchmark on a diverse range of asymmetries between judges and agents, extending previous work on a single extractive QA task with information asymmetry, to also include mathematics, coding, logic and multimodal reasoning asymmetries. We perform a large-scale evaluation sweeping over 9 tasks, each sampling 128 questions, totalling approximately 5 million model generation calls, affording us insight on which aspects of our study are practically significant.
Researcher Affiliation Industry All authors Google DeepMind.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No We do not include our code but could aim to at a later date.
Open Datasets Yes QuALITY [32], BoolQ [11], GPQA-extractive [39], MMLU [20], GSM8KQA [12], PrOnto QA [41], Truthful QA [29], GPQA [39], MMMU [48]
Dataset Splits No The paper mentions sweeping over 9 tasks, each sampling 128 questions for evaluation, but does not specify explicit training, validation, and test dataset splits with percentages or sample counts for its own experimental setup.
Hardware Specification No This is an evaluation paper which requires access to server APIs rather than own compute so we don t report the compute resources used.
Software Dependencies No The paper mentions using 'Scipy s permutation_test' but does not provide specific version numbers for software libraries, frameworks, or programming languages used in the experiments.
Experiment Setup Yes For these results we select the following settings: for consultancy/debate, we use Pro 1.5 as consultant/debaters and have 3 rounds of interaction. For debate, we use simultaneous turns with debaters selecting their responses through Best-of-4: 4 samples are independently generated, and Pro 1.5 is prompted to select the most persuasive one (more details in Appendix F). Judges are 0-shot prompted to predict the answer given the protocol transcript. Models are used 1-shot, with default sampling options unless otherwise specified. Our prompts are adapted from Khan et al. [25] with a few modifications: changed 'quote' to 'passage'...