MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs
Authors: Zhongshen Zeng, Yinhong Liu, Yingjia Wan, Jingyao Li, Pengguang Chen, Jianbo Dai, Yuxuan Yao, Rongwu Xu, Zehan Qi, Wanru Zhao, Linling Shen, Jianqiao Lu, Haochen Tan, Yukang Chen, Hao Zhang, Zhan Shi, Bailin Wang, Zhijiang Guo, Jiaya Jia
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | we present a process-based benchmark MR-Ben that demands a meta-reasoning skill, where LMs are asked to locate and analyse potential errors in automatically generated reasoning steps. MR-Ben comprises 5,975 questions curated by human experts across a wide range of subjects, including physics, chemistry, logic, coding, and more. Through our designed metrics for assessing meta-reasoning on this benchmark, we identify interesting limitations and weaknesses of current LLMs (open-source and closed-source models). |
| Researcher Affiliation | Academia | 1Chinese University of Hong Kong 2University of Cambridge 3University of Edinburgh 4City University of Hong Kong 5Tsinghua University 6University of Texas at Austin 7University of Hong Kong 8Nanyang Technological University 9Massachusetts Institute of Technology |
| Pseudocode | No | The paper includes code snippets as examples of solutions or errors, but not as structured pseudocode or algorithm blocks describing the paper's own methodology. |
| Open Source Code | Yes | Our dataset and codes are available on https://randolph-zeng.github.io/Mr-Ben.github.io. |
| Open Datasets | Yes | To ensure this breadth, we curated questions from various subjects, including natural sciences (mathematics, biology, physics), coding, and logic. Specifically, we sampled questions from mathematics, physics, biology, chemistry, and medicine from MMLU [27], which comprehensively assesses LLMs across academic and professional domains. For logic questions, we draw from Logi QA [40], which encompasses a broad spectrum of logical reasoning types, including categorical, conditional, disjunctive, and conjunctive reasoning. Finally, we select coding problems from MHPP [17] |
| Dataset Splits | No | The paper focuses on evaluating LLMs on their prepared benchmark (MR-Ben). While it mentions a 'hold-out set' for annotator agreement and few-shot examples for in-context learning, it does not specify explicit train/validation/test splits of the MR-Ben dataset for training or validating a model developed by the authors. |
| Hardware Specification | Yes | For local inference, we are using A800 machines with 8 GPUs to run the inferences. |
| Software Dependencies | No | The paper mentions specific LLMs used to generate data (e.g., GPT-3.5-Turbo-0125, Claude2, Mistral-Medium) and a fast inference library ('vllm'), but does not provide specific version numbers for software dependencies (e.g., Python, PyTorch) or for 'vllm'. |
| Experiment Setup | Yes | Consequently, we employed a step-wise chain-of-thought prompting technique similar to those described in [77, 64]. This approach guides models in systematically reasoning through solution traces before making final decisions, as detailed in Appendix-D. Considering the complexity of the task, which includes question comprehension, reasoning through the provided solutions, and adhering to format constraints, few-shot demonstration setups are also explored to investigate if models can benefit from In-Context Learning (ICL) examples. Due to the context token limits, we report zero and one-shot results in the main result table (Table 2). |