Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2
Authors: Yuri Chervonyi, Trieu H. Trinh, Miroslav Olšák, Xiaomeng Yang, Hoang H. Nguyen, Marcelo Menegali, Junehyuk Jung, Junsu Kim, Vikas Verma, Quoc V. Le, Thang Luong
JMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | AG2 achieves a new state-of-the-art solving rate of 84% on all IMO geometry problems from 2000 to 2024, compared to 54% achieved in AG1. This demonstrates a significant leap forward in AI s ability to tackle challenging mathematical reasoning tasks, surpassing an average IMO gold medalist. We also run ablation studies on how inference settings affect the overall performance (see Figure 9). |
| Researcher Affiliation | Collaboration | Yuri Chervonyi EMAIL Trieu H. Trinh EMAIL Miroslav Olˇs ak EMAIL Xiaomeng Yang EMAIL Hoang H. Nguyen EMAIL Marcelo Menegali EMAIL Junehyuk Jung junehyuk EMAIL Junsu Kim EMAIL Vikas Verma EMAIL Quoc V. Le EMAIL Thang Luong EMAIL . Equal contributions . Google Deep Mind . University of Cambridge . Georgia Institute of Technology . Brown University . Seoul National University |
| Pseudocode | Yes | def prune points ( points : set [ Point ] , check provable : Callable [ [ set [ Point ] ] , bool ] ) : pruned = set ( points ) for p in r e v e r s e t o p o l o g i c a l ( points ) : i f check provable ( pruned {p } ) : pruned = pruned {p} return Figure 3: Basic greedy algorithm to find a minimal set of points satisfying a monotonic predicate check. |
| Open Source Code | Yes | Code: https://github.com/google-deepmind/alphageometry2. Code for the Python implementation of the symbolic engine (DDAR2), along with multiple examples of proven IMO problems, will be shared at https://github.com/google-deepmind/alphageometry2. |
| Open Datasets | Yes | We share 27 IMO problems translated to the Alpha Geometry language along with their diagrams and solutions. They can be found in the file test.py within the provided repository. |
| Dataset Splits | Yes | Apart from a large synthetic training set of around 300 million theorems, we create three evaluation sets: 1. Synthetic problem set with and without auxiliary points, eval . 2. Synthetic problem set with only auxiliary points, eval aux . 3. Special set of geometry problems from IMO 2000-2024 that have been solved by Alpha Geometry previously, imo eval . |
| Hardware Specification | Yes | While on average DDAR1 finishes its computations in 1179.57 8.055 seconds, DDAR2 is much faster finishing in 3.44711 0.05476 seconds1. The average running time may vary depending on the machine status at different times. We run the test 50 times on a machine with AMD EPYC 7B13 64-core CPU. For proof search, we use TPUv4 to serve multiple replicas per model3 and let different search trees within the same model to query the same server under their own search strategy. using TPUv4. |
| Software Dependencies | No | The new C++ library, which is exported into Python via pybind11 (Jakob et al., 2017), is over 300 times faster than DDAR1. |
| Experiment Setup | Yes | In contrast to AG1, we use top-k sampling with temperature t = 1.0 and k = 32. Note that a high temperature and multiple samples are essential for solving IMO problems. With the greedy decoding t = 0.0, k = 1, and no tree search, our models can solve only two problems out of 26 that require auxiliary constructions. Increasing the temperature to t = 1.0 and using k = 32 samples (without a search tree) allows our language models to solve 9 out of 26 problems. We train our models with the largest possible batch size allowed by the hardware4 using TPUv4. A learning rate schedule is a linear warm-up followed by the cosine anneal. Learning rate hyperparameters are determined from scaling laws. For a single search tree, we find that the optimal configuration is a beam size of 128, a beam depth of 4, and 32 samples. |