Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation

Authors: Jiajun Shi, Jian Yang, Jiaheng Liu, Xingyuan Bu, Jiangjie Chen, Junting Zhou, Kaijing Ma, Zhoufutu Wen, Bingli Wang, Yancheng He, Liang Song, Hualei Zhu, Shilong Li, Xingjian Wang, Wei Zhang, Ruibin Yuan, Yifan Yao, Wenjun Yang, Yunli Wang, Siyuan Fang, Siyu Yuan, Qianyu He, Robert Tang, Yingshui Tan, Wangchunshu Zhou, ZHAO-XIANG ZHANG, Zhoujun Li, Wenhao Huang, Ge Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using KORGym, we conduct extensive experiments on 19 LLMs and 8 VLMs, revealing consistent reasoning patterns within model families and demonstrating the superior performance of closed-source models. Further analysis examines the effects of modality, reasoning strategies, reinforcement learning techniques, and response length on model performance.
Researcher Affiliation Collaboration Jiajun Shi1,2,3 *, Jian Yang1,2 * , Jiaheng Liu2,4, Xingyuan Bu2, Jiangjie Chen3, Junting Zhou2, Kaijing Ma2,3, Zhoufutu Wen2,3, Bingli Wang2, Yancheng He2, Liang Song2, Hualei Zhu2, Shilong Li2, Xingjian Wang2, Wei Zhang2, Ruibin Yuan2, Yifan Yao2, Wenjun Yang2, Yunli Wang2, Siyuan Fang2, Siyu Yuan4, Qianyu He4, Xiangru Tang2, Yingshui Tan2, Wangchunshu Zhou, Zhaoxiang Zhang5, Zhoujun Li1, Wenhao Huang3, , Ge Zhang2,3, , 1SKLCCSE, Beihang University 2M-A-P 3Byte Dance Seed 4Nanjing University 5CASIA
Pseudocode No The paper describes the system workflow and interaction mechanisms in narrative text under '3.1 Framework' and 'Based on these modules, the primary inference workflow of KORGym proceeds as follows:', but does not present any formal pseudocode blocks or algorithms.
Open Source Code Yes Our codebase and experimental results are available at: https://github.com/ multimodal-art-projection/KORGym
Open Datasets Yes We design a suite of over fifty textand vision-based games tailored to evaluate the reasoning capabilities of large language models. We present KORGym, an extensible framework supporting incremental development and reinforcement-learning integration.
Dataset Splits Yes Single-epoch Games: Each model is evaluated on 50 independently initialized game instances by varying the seed parameter in the generate API from 1 to 50. Multiple-epoch Games: For each model, we initialize 20 game environments. Each episode permits up to 100 interaction rounds, and we vary the seed parameter in the generate API from 1 to 50 for reproducibility.
Hardware Specification Yes We evaluate closed-source models via their hosted APIs and open-source models on eight NVIDIA A100-80G GPUs.
Software Dependencies No The paper mentions that KORGym is built on 'Gymnasium [2]' but does not specify version numbers for Gymnasium or any other software libraries or dependencies used in their implementation.
Experiment Setup Yes All assessments use a zero-shot prompting setup to gauge genuine reasoning capabilities, retaining each model s default sampling parameters (temperature and top-p).