Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Autoformalizing Euclidean Geometry

Authors: Logan Murphy, Kaiyu Yang, Jialiang Sun, Zhaoyu Li, Anima Anandkumar, Xujie Si

ICML 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments with GPT-4 and GPT-4V show the capability and limitations of state-of-the-art LLMs on autoformalizing geometry problems.
Researcher Affiliation	Academia	Logan Murphy 1 * Kaiyu Yang 2 * Jialiang Sun 1 Zhaoyu Li 1 Anima Anandkumar 2 Xujie Si 1 1University of Toronto 2Caltech.
Pseudocode	Yes	We describe our variant of the formal system E in Lean-like pseudocode, including the complete list of axioms.
Open Source Code	Yes	The data and code are available at https://github. com/loganrjmurphy/Lean Euclid.
Open Datasets	Yes	We construct Lean Euclid, an autoformalization benchmark consisting of problems from Euclid s Elements and the Uni Geo dataset formalized in the Lean proof assistant. ... Data examples in Lean Euclid are manually formalized into Lean from Euclid s Elements (Heiberg, 2007) and the Uni Geo dataset (Chen et al., 2022).
Dataset Splits	Yes	To implement few-shot learning, we randomly selected five propositions from Euclid s Elements and five problems from each category in the Uni Geo dataset, serving as in-context learning examples. We then use E3 to automatically evaluate the results of each round. To see how well E3 correlates with human evaluation, we manually evaluate a sample of formalized theorems from Elements to identify any false negatives/positives.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory) are mentioned for the experiments, only the LLM models used (GPT-4, GPT-4V).
Software Dependencies	Yes	Experiments were conducted in January 2024 using gpt-4-1106-preview and gpt-4-1106-vision-preview.
Experiment Setup	Yes	To implement few-shot learning, we randomly selected five propositions from Euclid s Elements and five problems from each category in the Uni Geo dataset, serving as in-context learning examples. ... Our prompt template contains an overview of Lean Euclid syntax, some examples of wellformed formulas, and various guidelines for the task. For few-shot learning, we also include k examples of informal-formal pairs as in-context demonstrations. The complete prompt template is in Appendix D.