Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
IneqSearch: Hybrid Reasoning for Olympiad Inequality Proofs
Authors: Zhaoqun Li, Beishui Liao, Qiwei Ye
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In empirical evaluation on 437 Olympiad-level inequalities, Ineq Search successfully proves 342 problems, significantly outperforming existing methods and demonstrating the effectiveness of integrating symbolic and neural approaches for mathematical reasoning. |
| Researcher Affiliation | Collaboration | Zhaoqun Li Zhejiang University Beijing Academy of Artificial Intelligence EMAIL Beishui Liao Zhejiang University EMAIL Qiwei Ye Beijing Academy of Artificial Intelligence EMAIL |
| Pseudocode | No | The paper describes methods in prose and provides a prompt template, but does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The data is provided. The code is not yet ready for publication. |
| Open Datasets | Yes | We constructed INEQ-437, a benchmark of 437 Olympiad-level inequalities compiled from authoritative sources [14, 13]. We included all problems expressible in the form of Equation 1 without manual selection, representing approximately 90% of inequalities in these sources. The detailed content of the benchmark can be found in the supplementary materials. [...] On the challenging MO-INT dataset [1], our method achieves 100% success rate, demonstrating its effectiveness for particularly difficult cases. |
| Dataset Splits | No | The paper uses the INEQ-437 and MO-INT-20 benchmarks for evaluation but does not specify any training, validation, or test dataset splits. The 'three-shot setting' mentioned refers to the LLM evaluation strategy, not data partitioning. |
| Hardware Specification | No | No GPU was used in the experiments. All computations can be executed on a standard personal computer. |
| Software Dependencies | No | The paper mentions 'Sym Py' and 'Lean4' with 'Mathlib' but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | The temperature hyperparameter is set to 0.3 for all language models when applicable. |