On scalable oversight with weak LLMs judging strong LLMs
Authors: Zachary Kenton, Noah Siegel, Janos Kramar, Jonah Brown-Cohen, Samuel Albanie, Jannis Bulian, Rishabh Agarwal, David Lindner, Yunhao Tang, Noah Goodman, Rohin Shah
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We benchmark on a diverse range of asymmetries between judges and agents, extending previous work on a single extractive QA task with information asymmetry, to also include mathematics, coding, logic and multimodal reasoning asymmetries. We perform a large-scale evaluation sweeping over 9 tasks, each sampling 128 questions, totalling approximately 5 million model generation calls, affording us insight on which aspects of our study are practically significant. |
| Researcher Affiliation | Industry | All authors Google DeepMind. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | We do not include our code but could aim to at a later date. |
| Open Datasets | Yes | QuALITY [32], BoolQ [11], GPQA-extractive [39], MMLU [20], GSM8KQA [12], PrOnto QA [41], Truthful QA [29], GPQA [39], MMMU [48] |
| Dataset Splits | No | The paper mentions sweeping over 9 tasks, each sampling 128 questions for evaluation, but does not specify explicit training, validation, and test dataset splits with percentages or sample counts for its own experimental setup. |
| Hardware Specification | No | This is an evaluation paper which requires access to server APIs rather than own compute so we don t report the compute resources used. |
| Software Dependencies | No | The paper mentions using 'Scipy s permutation_test' but does not provide specific version numbers for software libraries, frameworks, or programming languages used in the experiments. |
| Experiment Setup | Yes | For these results we select the following settings: for consultancy/debate, we use Pro 1.5 as consultant/debaters and have 3 rounds of interaction. For debate, we use simultaneous turns with debaters selecting their responses through Best-of-4: 4 samples are independently generated, and Pro 1.5 is prompted to select the most persuasive one (more details in Appendix F). Judges are 0-shot prompted to predict the answer given the protocol transcript. Models are used 1-shot, with default sampling options unless otherwise specified. Our prompts are adapted from Khan et al. [25] with a few modifications: changed 'quote' to 'passage'... |