Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles
Authors: Qi Chen, Bowen Zhang, Gang Wang, Qi Wu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experiments demonstrate that a robust evaluation model, such as Wizard LM-2, closely matches human judgements in both intermediate question-answering and final scenario accuracy, achieving over 80% agreement similar to the agreement levels among humans. Furthermore, applying data and reasoning processes from our benchmark to other lateral thinking-related benchmarks, e.g., Riddle Sense and Brain Teaser, leads to performance enhancements. |
| Researcher Affiliation | Collaboration | Qi Chen Bowen Zhang Gang Wang Qi Wu Australian Institute for Machine Learning, University of Adelaide {qi.chen04, b.zhang, qi.wu01}@adelaide.edu.au, gang@theb.ai |
| Pseudocode | No | The paper describes the interaction dynamics and character initialisation for the player-judge framework in prose and figures, but does not include formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at: https://github.com/chenqi008/LateralThinking. |
| Open Datasets | Yes | In this context, we introduce SPLAT, a benchmark that leverages Situation Puzzles for evaluating and eliciting LAteral Thinking of LLMs, which contains 975 high-quality situation puzzle pairs. ... Upon publication, we will release our data, as all the puzzles are sourced from publicly accessible websites. |
| Dataset Splits | No | The paper evaluates LLMs on the SPLAT benchmark, which serves as the test set, but does not specify training/validation/test splits for training models on this dataset, as the LLMs are pre-trained and evaluated in a zero-shot setting. |
| Hardware Specification | No | As we use the APIs for all the LLMs employed in this work, any standard personal computer can handle it. |
| Software Dependencies | No | The paper mentions specific LLMs and models by name (e.g., Llama3, Wizard LM-2) and refers to libraries like 'Auto Correct library' and 'Sentence-BERT', but does not provide specific version numbers for ancillary software dependencies required for replication. |
| Experiment Setup | Yes | We set R = 1 and the max round is 15. In Table 4, GPT-4 and its Turbo variant, along with the Llama3 (70B-Instruct) and Wizard LM-2 (8x22B), show robust capabilities, achieving relatively higher accuracy. ... To effectively integrate our benchmark s data and reasoning processes from the proposed player-judge framework, we treat the question-answer pairs generated during the benchmark as auxiliary prompts. These are seamlessly incorporated into the LLMs reasoning processes to enhance their thinking ability. |