Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles

Authors: Qi Chen, Bowen Zhang, Gang Wang, Qi Wu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experiments demonstrate that a robust evaluation model, such as Wizard LM-2, closely matches human judgements in both intermediate question-answering and final scenario accuracy, achieving over 80% agreement similar to the agreement levels among humans. Furthermore, applying data and reasoning processes from our benchmark to other lateral thinking-related benchmarks, e.g., Riddle Sense and Brain Teaser, leads to performance enhancements.
Researcher Affiliation Collaboration Qi Chen Bowen Zhang Gang Wang Qi Wu Australian Institute for Machine Learning, University of Adelaide {qi.chen04, b.zhang, qi.wu01}@adelaide.edu.au, gang@theb.ai
Pseudocode No The paper describes the interaction dynamics and character initialisation for the player-judge framework in prose and figures, but does not include formal pseudocode or algorithm blocks.
Open Source Code Yes Code is available at: https://github.com/chenqi008/LateralThinking.
Open Datasets Yes In this context, we introduce SPLAT, a benchmark that leverages Situation Puzzles for evaluating and eliciting LAteral Thinking of LLMs, which contains 975 high-quality situation puzzle pairs. ... Upon publication, we will release our data, as all the puzzles are sourced from publicly accessible websites.
Dataset Splits No The paper evaluates LLMs on the SPLAT benchmark, which serves as the test set, but does not specify training/validation/test splits for training models on this dataset, as the LLMs are pre-trained and evaluated in a zero-shot setting.
Hardware Specification No As we use the APIs for all the LLMs employed in this work, any standard personal computer can handle it.
Software Dependencies No The paper mentions specific LLMs and models by name (e.g., Llama3, Wizard LM-2) and refers to libraries like 'Auto Correct library' and 'Sentence-BERT', but does not provide specific version numbers for ancillary software dependencies required for replication.
Experiment Setup Yes We set R = 1 and the max round is 15. In Table 4, GPT-4 and its Turbo variant, along with the Llama3 (70B-Instruct) and Wizard LM-2 (8x22B), show robust capabilities, achieving relatively higher accuracy. ... To effectively integrate our benchmark s data and reasoning processes from the proposed player-judge framework, we treat the question-answer pairs generated during the benchmark as auxiliary prompts. These are seamlessly incorporated into the LLMs reasoning processes to enhance their thinking ability.