Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2

Authors: Yuri Chervonyi, Trieu H. Trinh, Miroslav Olšák, Xiaomeng Yang, Hoang H. Nguyen, Marcelo Menegali, Junehyuk Jung, Junsu Kim, Vikas Verma, Quoc V. Le, Thang Luong

JMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	AG2 achieves a new state-of-the-art solving rate of 84% on all IMO geometry problems from 2000 to 2024, compared to 54% achieved in AG1. This demonstrates a signiﬁcant leap forward in AI s ability to tackle challenging mathematical reasoning tasks, surpassing an average IMO gold medalist. We also run ablation studies on how inference settings aﬀect the overall performance (see Figure 9).
Researcher Affiliation	Collaboration	Yuri Chervonyi EMAIL Trieu H. Trinh EMAIL Miroslav Olˇs ak EMAIL Xiaomeng Yang EMAIL Hoang H. Nguyen EMAIL Marcelo Menegali EMAIL Junehyuk Jung junehyuk EMAIL Junsu Kim EMAIL Vikas Verma EMAIL Quoc V. Le EMAIL Thang Luong EMAIL . Equal contributions . Google Deep Mind . University of Cambridge . Georgia Institute of Technology . Brown University . Seoul National University
Pseudocode	Yes	def prune points ( points : set [ Point ] , check provable : Callable [ [ set [ Point ] ] , bool ] ) : pruned = set ( points ) for p in r e v e r s e t o p o l o g i c a l ( points ) : i f check provable ( pruned {p } ) : pruned = pruned {p} return Figure 3: Basic greedy algorithm to ﬁnd a minimal set of points satisfying a monotonic predicate check.
Open Source Code	Yes	Code: https://github.com/google-deepmind/alphageometry2. Code for the Python implementation of the symbolic engine (DDAR2), along with multiple examples of proven IMO problems, will be shared at https://github.com/google-deepmind/alphageometry2.
Open Datasets	Yes	We share 27 IMO problems translated to the Alpha Geometry language along with their diagrams and solutions. They can be found in the ﬁle test.py within the provided repository.
Dataset Splits	Yes	Apart from a large synthetic training set of around 300 million theorems, we create three evaluation sets: 1. Synthetic problem set with and without auxiliary points, eval . 2. Synthetic problem set with only auxiliary points, eval aux . 3. Special set of geometry problems from IMO 2000-2024 that have been solved by Alpha Geometry previously, imo eval .
Hardware Specification	Yes	While on average DDAR1 ﬁnishes its computations in 1179.57 8.055 seconds, DDAR2 is much faster ﬁnishing in 3.44711 0.05476 seconds1. The average running time may vary depending on the machine status at diﬀerent times. We run the test 50 times on a machine with AMD EPYC 7B13 64-core CPU. For proof search, we use TPUv4 to serve multiple replicas per model3 and let diﬀerent search trees within the same model to query the same server under their own search strategy. using TPUv4.
Software Dependencies	No	The new C++ library, which is exported into Python via pybind11 (Jakob et al., 2017), is over 300 times faster than DDAR1.
Experiment Setup	Yes	In contrast to AG1, we use top-k sampling with temperature t = 1.0 and k = 32. Note that a high temperature and multiple samples are essential for solving IMO problems. With the greedy decoding t = 0.0, k = 1, and no tree search, our models can solve only two problems out of 26 that require auxiliary constructions. Increasing the temperature to t = 1.0 and using k = 32 samples (without a search tree) allows our language models to solve 9 out of 26 problems. We train our models with the largest possible batch size allowed by the hardware4 using TPUv4. A learning rate schedule is a linear warm-up followed by the cosine anneal. Learning rate hyperparameters are determined from scaling laws. For a single search tree, we ﬁnd that the optimal conﬁguration is a beam size of 128, a beam depth of 4, and 32 samples.