Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ATLAS: Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data

Authors: Xiaoyang Liu, Kangjie Bao, Jiashuo Zhang, Yunqi Liu, Yu Chen, Yuntian Liu, Yang Jiao, Tao Luo

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Running the proposed ATLAS framework for 10 iterations, we construct an undergraduate-level dataset of 117k theorem statements and develop the ATLAS Translator by fine-tuning Llama3.1-8B-Instruct with Lo RA. This model establishes a new state of the art, demonstrating statistically significant improvements over both the Herald Translator and the Kimina-Autoformalizer across all benchmarks (p < 0.05, two-sided t-test).
Researcher Affiliation	Collaboration	1 School of Mathematical Sciences, Shanghai Jiao Tong University 2 Institute of Natural Sciences, MOE-LSC, CMA-Shanghai, Shanghai Jiao Tong University 3 SPEIT, Shanghai Jiao Tong University 4 Join Tech Co., Ltd EMAIL
Pseudocode	No	The paper includes 'Figure 1: The overview of the proposed ATLAS framework.' which is a diagram illustrating the process, and methodological steps are described in text, but it does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	The datasets, model, and code are available at https://github.com/Xiaoyang Liu-sjtu/ATLAS.
Open Datasets	Yes	The datasets, model, and code are available at https://github.com/Xiaoyang Liu-sjtu/ATLAS. (...) We introduce the ATLAS dataset and the Math Qual dataset. The former comprises 117k undergraduate-level parallel statements, making it one of the largest available.
Dataset Splits	No	The paper describes the creation of the ATLAS dataset for training and references external benchmarks (Proof Net, Putnam Bench, Math Qual) for evaluation. However, it does not specify explicit train/validation/test splits (e.g., percentages or counts) for the ATLAS dataset itself, stating only that Llama3.1-8B-Instruct is 're-trained on the data generated during expert iteration (referred to as the ATLAS dataset)'.
Hardware Specification	Yes	All experiments are conducted on a single NVIDIA A100 GPU with 40GB of memory.
Software Dependencies	Yes	We employ Deep Seek-V2.5 [22] as the teacher model and Llama3.1-8B-Instruct [8] as the student model, for which we utilize nucleus sampling [12] with top p=0.9 and a temperature of 0.6 for generation. (...) All three stages are fine-tuned using Lo RA [13] with LLa MA-Factory [57] for 3 epochs (...) Intern LM2-Math-Plus-7B [52] (...) Qwen2.5 [49]
Experiment Setup	Yes	We employ Deep Seek-V2.5 [22] as the teacher model and Llama3.1-8B-Instruct [8] as the student model, for which we utilize nucleus sampling [12] with top p=0.9 and a temperature of 0.6 for generation. (...) All three stages are fine-tuned using Lo RA [13] with LLa MA-Factory [57] for 3 epochs, with a total batch size of 128 and a learning rate of 1.0e-5 with a cosine decay schedule. (...) Deep Seek-V3 is set with a sampling temperature of 0.7