Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

IneqSearch: Hybrid Reasoning for Olympiad Inequality Proofs

Authors: Zhaoqun Li, Beishui Liao, Qiwei Ye

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In empirical evaluation on 437 Olympiad-level inequalities, Ineq Search successfully proves 342 problems, significantly outperforming existing methods and demonstrating the effectiveness of integrating symbolic and neural approaches for mathematical reasoning.
Researcher Affiliation	Collaboration	Zhaoqun Li Zhejiang University Beijing Academy of Artificial Intelligence EMAIL Beishui Liao Zhejiang University EMAIL Qiwei Ye Beijing Academy of Artificial Intelligence EMAIL
Pseudocode	No	The paper describes methods in prose and provides a prompt template, but does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	The data is provided. The code is not yet ready for publication.
Open Datasets	Yes	We constructed INEQ-437, a benchmark of 437 Olympiad-level inequalities compiled from authoritative sources [14, 13]. We included all problems expressible in the form of Equation 1 without manual selection, representing approximately 90% of inequalities in these sources. The detailed content of the benchmark can be found in the supplementary materials. [...] On the challenging MO-INT dataset [1], our method achieves 100% success rate, demonstrating its effectiveness for particularly difficult cases.
Dataset Splits	No	The paper uses the INEQ-437 and MO-INT-20 benchmarks for evaluation but does not specify any training, validation, or test dataset splits. The 'three-shot setting' mentioned refers to the LLM evaluation strategy, not data partitioning.
Hardware Specification	No	No GPU was used in the experiments. All computations can be executed on a standard personal computer.
Software Dependencies	No	The paper mentions 'Sym Py' and 'Lean4' with 'Mathlib' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	The temperature hyperparameter is set to 0.3 for all language models when applicable.