Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks

Authors: Kaijing Ma, Xeron Du, Yunran Wang, Haoran Zhang, Zhoufutu Wen, Xingwei Qu, Jian Yang, JIAHENG LIU, Minghao Liu, Xiang Yue, Wenhao Huang, Ge Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental O1-Preview and O1-Mini achieve accuracies of 72.88% and 70.16%, surpassing Claude-3.5-Sonnet and GPT-4o (58.96% and 58.00%), highlighting the effectiveness of KOR-Bench. We perform detailed analyses, identifying bottlenecks in the Cipher task with Stepwise Prompting, where two rounds of Self Correction yield optimal results. We evaluate performance across three integrated tasks, explore the impact of Tricks on the Puzzle task, and visualize rule-focused attention. Additionally, we conduct an ablation study on dataset size, benchmark correlations, and zero-shot and three-shot "only questions" experiments.
Researcher Affiliation Collaboration 1Multimodal Art Projection Research Community, 2Byte Dance.Inc, 301.AI, 42077.AI, 5Tongji University, 6Γ‰cole Polytechnique, 7University of Illinois at Urbana-Champaign, 8University of Manchester, 9Nanjing University, 10Carnegie Mellon University
Pseudocode No The paper describes the data construction process and various reasoning tasks in natural language. There are prompt templates in Appendix D, but no sections explicitly labeled 'Pseudocode' or 'Algorithm', nor structured code-like blocks for an algorithm.
Open Source Code No Code: We have developed and will release a comprehensive codebase that includes: Scripts for data loader; Implementation of all evaluation metrics; Code for running experiments.
Open Datasets No Dataset: The complete KOR-Bench dataset, including all rules, questions and answers, will be made publicly available upon publication.
Dataset Splits No The paper evaluates pre-trained models on the KOR-Bench dataset. While the paper describes an ablation study on dataset size involving subsets and proportions, and a three-shot prompting strategy that uses three Q&A pairs for in-context learning, it does not explicitly define traditional training/validation/test splits for the KOR-Bench dataset itself for model training purposes. The KOR-Bench essentially serves as the test set for the evaluations presented.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running the experiments. It lists various models that were evaluated but does not specify the computational resources used for the evaluation.
Software Dependencies No Specifically, for mathematical expressions, Sym Py (Meurer et al., 2017) is used for parsing in La Te X format and simplifying the expressions for comparison.
Experiment Setup Yes Prompting Strategy. Zero-shot prompting strategy for chat model generates responses based on newly defined rules and questions, as outlined in prompt template in Appendix D. Base model uses three-shot strategy, providing three generic Q&A pairs for each rule to support in-context learning. Evaluation Methodology. We parse the output by regular expression1 to try to match the contents of the double square brackets... Comprehensive details regarding the extraction and evaluation can be found in Appendix C.4.