Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Autoformalizing Euclidean Geometry
Authors: Logan Murphy, Kaiyu Yang, Jialiang Sun, Zhaoyu Li, Anima Anandkumar, Xujie Si
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments with GPT-4 and GPT-4V show the capability and limitations of state-of-the-art LLMs on autoformalizing geometry problems. |
| Researcher Affiliation | Academia | Logan Murphy 1 * Kaiyu Yang 2 * Jialiang Sun 1 Zhaoyu Li 1 Anima Anandkumar 2 Xujie Si 1 1University of Toronto 2Caltech. |
| Pseudocode | Yes | We describe our variant of the formal system E in Lean-like pseudocode, including the complete list of axioms. |
| Open Source Code | Yes | The data and code are available at https://github. com/loganrjmurphy/Lean Euclid. |
| Open Datasets | Yes | We construct Lean Euclid, an autoformalization benchmark consisting of problems from Euclid s Elements and the Uni Geo dataset formalized in the Lean proof assistant. ... Data examples in Lean Euclid are manually formalized into Lean from Euclid s Elements (Heiberg, 2007) and the Uni Geo dataset (Chen et al., 2022). |
| Dataset Splits | Yes | To implement few-shot learning, we randomly selected five propositions from Euclid s Elements and five problems from each category in the Uni Geo dataset, serving as in-context learning examples. We then use E3 to automatically evaluate the results of each round. To see how well E3 correlates with human evaluation, we manually evaluate a sample of formalized theorems from Elements to identify any false negatives/positives. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) are mentioned for the experiments, only the LLM models used (GPT-4, GPT-4V). |
| Software Dependencies | Yes | Experiments were conducted in January 2024 using gpt-4-1106-preview and gpt-4-1106-vision-preview. |
| Experiment Setup | Yes | To implement few-shot learning, we randomly selected five propositions from Euclid s Elements and five problems from each category in the Uni Geo dataset, serving as in-context learning examples. ... Our prompt template contains an overview of Lean Euclid syntax, some examples of wellformed formulas, and various guidelines for the task. For few-shot learning, we also include k examples of informal-formal pairs as in-context demonstrations. The complete prompt template is in Appendix D. |