reproducibilityindex.ai

MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs

Authors: Zhongshen Zeng, Yinhong Liu, Yingjia Wan, Jingyao Li, Pengguang Chen, Jianbo Dai, Yuxuan Yao, Rongwu Xu, Zehan Qi, Wanru Zhao, Linling Shen, Jianqiao Lu, Haochen Tan, Yukang Chen, Hao Zhang, Zhan Shi, Bailin Wang, Zhijiang Guo, Jiaya Jia

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	we present a process-based benchmark MR-Ben that demands a meta-reasoning skill, where LMs are asked to locate and analyse potential errors in automatically generated reasoning steps. MR-Ben comprises 5,975 questions curated by human experts across a wide range of subjects, including physics, chemistry, logic, coding, and more. Through our designed metrics for assessing meta-reasoning on this benchmark, we identify interesting limitations and weaknesses of current LLMs (open-source and closed-source models).
Researcher Affiliation	Academia	1Chinese University of Hong Kong 2University of Cambridge 3University of Edinburgh 4City University of Hong Kong 5Tsinghua University 6University of Texas at Austin 7University of Hong Kong 8Nanyang Technological University 9Massachusetts Institute of Technology
Pseudocode	No	The paper includes code snippets as examples of solutions or errors, but not as structured pseudocode or algorithm blocks describing the paper's own methodology.
Open Source Code	Yes	Our dataset and codes are available on https://randolph-zeng.github.io/Mr-Ben.github.io.
Open Datasets	Yes	To ensure this breadth, we curated questions from various subjects, including natural sciences (mathematics, biology, physics), coding, and logic. Specifically, we sampled questions from mathematics, physics, biology, chemistry, and medicine from MMLU [27], which comprehensively assesses LLMs across academic and professional domains. For logic questions, we draw from Logi QA [40], which encompasses a broad spectrum of logical reasoning types, including categorical, conditional, disjunctive, and conjunctive reasoning. Finally, we select coding problems from MHPP [17]
Dataset Splits	No	The paper focuses on evaluating LLMs on their prepared benchmark (MR-Ben). While it mentions a 'hold-out set' for annotator agreement and few-shot examples for in-context learning, it does not specify explicit train/validation/test splits of the MR-Ben dataset for training or validating a model developed by the authors.
Hardware Specification	Yes	For local inference, we are using A800 machines with 8 GPUs to run the inferences.
Software Dependencies	No	The paper mentions specific LLMs used to generate data (e.g., GPT-3.5-Turbo-0125, Claude2, Mistral-Medium) and a fast inference library ('vllm'), but does not provide specific version numbers for software dependencies (e.g., Python, PyTorch) or for 'vllm'.
Experiment Setup	Yes	Consequently, we employed a step-wise chain-of-thought prompting technique similar to those described in [77, 64]. This approach guides models in systematically reasoning through solution traces before making final decisions, as detailed in Appendix-D. Considering the complexity of the task, which includes question comprehension, reasoning through the provided solutions, and adhering to format constraints, few-shot demonstration setups are also explored to investigate if models can benefit from In-Context Learning (ICL) examples. Due to the context token limits, we report zero and one-shot results in the main result table (Table 2).