Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks

Authors: Kaijing Ma, Xeron Du, Yunran Wang, Haoran Zhang, Zhoufutu Wen, Xingwei Qu, Jian Yang, JIAHENG LIU, Minghao Liu, Xiang Yue, Wenhao Huang, Ge Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	O1-Preview and O1-Mini achieve accuracies of 72.88% and 70.16%, surpassing Claude-3.5-Sonnet and GPT-4o (58.96% and 58.00%), highlighting the effectiveness of KOR-Bench. We perform detailed analyses, identifying bottlenecks in the Cipher task with Stepwise Prompting, where two rounds of Self Correction yield optimal results. We evaluate performance across three integrated tasks, explore the impact of Tricks on the Puzzle task, and visualize rule-focused attention. Additionally, we conduct an ablation study on dataset size, benchmark correlations, and zero-shot and three-shot "only questions" experiments.
Researcher Affiliation	Collaboration	1Multimodal Art Projection Research Community, 2Byte Dance.Inc, 301.AI, 42077.AI, 5Tongji University, 6École Polytechnique, 7University of Illinois at Urbana-Champaign, 8University of Manchester, 9Nanjing University, 10Carnegie Mellon University
Pseudocode	No	The paper describes the data construction process and various reasoning tasks in natural language. There are prompt templates in Appendix D, but no sections explicitly labeled 'Pseudocode' or 'Algorithm', nor structured code-like blocks for an algorithm.
Open Source Code	No	Code: We have developed and will release a comprehensive codebase that includes: Scripts for data loader; Implementation of all evaluation metrics; Code for running experiments.
Open Datasets	No	Dataset: The complete KOR-Bench dataset, including all rules, questions and answers, will be made publicly available upon publication.
Dataset Splits	No	The paper evaluates pre-trained models on the KOR-Bench dataset. While the paper describes an ablation study on dataset size involving subsets and proportions, and a three-shot prompting strategy that uses three Q&A pairs for in-context learning, it does not explicitly define traditional training/validation/test splits for the KOR-Bench dataset itself for model training purposes. The KOR-Bench essentially serves as the test set for the evaluations presented.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running the experiments. It lists various models that were evaluated but does not specify the computational resources used for the evaluation.
Software Dependencies	No	Specifically, for mathematical expressions, Sym Py (Meurer et al., 2017) is used for parsing in La Te X format and simplifying the expressions for comparison.
Experiment Setup	Yes	Prompting Strategy. Zero-shot prompting strategy for chat model generates responses based on newly defined rules and questions, as outlined in prompt template in Appendix D. Base model uses three-shot strategy, providing three generic Q&A pairs for each rule to support in-context learning. Evaluation Methodology. We parse the output by regular expression1 to try to match the contents of the double square brackets... Comprehensive details regarding the extraction and evaluation can be found in Appendix C.4.