reproducibilityindex.ai

On scalable oversight with weak LLMs judging strong LLMs

Authors: Zachary Kenton, Noah Siegel, Janos Kramar, Jonah Brown-Cohen, Samuel Albanie, Jannis Bulian, Rishabh Agarwal, David Lindner, Yunhao Tang, Noah Goodman, Rohin Shah

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We benchmark on a diverse range of asymmetries between judges and agents, extending previous work on a single extractive QA task with information asymmetry, to also include mathematics, coding, logic and multimodal reasoning asymmetries. We perform a large-scale evaluation sweeping over 9 tasks, each sampling 128 questions, totalling approximately 5 million model generation calls, affording us insight on which aspects of our study are practically significant.
Researcher Affiliation	Industry	All authors Google DeepMind.
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	We do not include our code but could aim to at a later date.
Open Datasets	Yes	QuALITY [32], BoolQ [11], GPQA-extractive [39], MMLU [20], GSM8KQA [12], PrOnto QA [41], Truthful QA [29], GPQA [39], MMMU [48]
Dataset Splits	No	The paper mentions sweeping over 9 tasks, each sampling 128 questions for evaluation, but does not specify explicit training, validation, and test dataset splits with percentages or sample counts for its own experimental setup.
Hardware Specification	No	This is an evaluation paper which requires access to server APIs rather than own compute so we don t report the compute resources used.
Software Dependencies	No	The paper mentions using 'Scipy s permutation_test' but does not provide specific version numbers for software libraries, frameworks, or programming languages used in the experiments.
Experiment Setup	Yes	For these results we select the following settings: for consultancy/debate, we use Pro 1.5 as consultant/debaters and have 3 rounds of interaction. For debate, we use simultaneous turns with debaters selecting their responses through Best-of-4: 4 samples are independently generated, and Pro 1.5 is prompted to select the most persuasive one (more details in Appendix F). Judges are 0-shot prompted to predict the answer given the protocol transcript. Models are used 1-shot, with default sampling options unless otherwise specified. Our prompts are adapted from Khan et al. [25] with a few modifications: changed 'quote' to 'passage'...