Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Bootstrapping Hierarchical Autoregressive Formal Reasoner with Chain-of-Proxy-Autoformalization

Authors: Qi Liu, Xinhao Zheng, Renqiu Xia, Qinxiang Cao, Junchi Yan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate significant improvements: trained on data bootstrapped by Co PA, HAR achieves superior performance on Formal Math500 (15.50%7 44.09%) and Mini F2F-Solving (21.87%7 56.58%) with lower computational budget. Explorations reveal promising directions in formal solution pruning and informal dataset denoising.
Researcher Affiliation	Academia	Qi Liu, Xinhao Zheng, Renqiu Xia, Qinxiang Cao, Junchi Yan School of Computer Science & School of Artificial Intelligence, Shanghai Jiao Tong University Shanghai Innovation Institute EMAIL
Pseudocode	No	The paper describes the HAR and Co PA pipelines with figures (Fig. 1 and Fig. 2) and detailed textual explanations, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	https://github.com/Purewhite2019/har_copa_main
Open Datasets	Yes	Relevant benchmarks (Formal Math500 [7], Mini F2F-Solving [7, 9], and Putnam Bench-Solving [7, 60]), data (Numina-Co T [46], Numina-1.5 [46], and Lean-Workbook [46]), and base models (Qwen2.5-Math-7B [58], Qwen2.5-7B [65]) are released under the Apache 2.0 License. The original MATH [8] dataset, Math Odessy [61] and Phi-4-mini-instruct [62] are released under the MIT license.
Dataset Splits	Yes	HAR and its baselines are evaluated on Formal Math500 [7] and Mini F2F-Solving [9, 7]. Formal Math500 is a formalized subset of MATH500 [48], consisting of 387 formal problems. Mini F2F-Solving is a refactored subset of Mini F2F [9], consisting of 375 formal problems.
Hardware Specification	Yes	Cycle 1 data generation requires 8 Ascend-910B NPUs and 192 Kunpeng-920 CPUs for over a month. Each fine-tuning (problem autoformalizer, solution drafter, and proof searcher) requires 8 Ascend-910B NPUs and 192 Kunpeng-920 CPUs for about one week. Non-hierarchical experiments (BFS, WG, AR) require 1 Ascend-910B NPU and 64 Kunpeng-920 CPUs for two days; Hierarchical experiments (H-BFS, H-WG, H-SA, HAR) require 2 Ascend-910B NPUs and 64 Kunpeng-920 CPUs for three days.
Software Dependencies	Yes	In this project, the Lean 4 environment relies on the following open-source projects Lean 4 [43] v4.15.0 Mathlib 4 [50] v4.15.0 Aesop [44] v4.15.0 Pantograph [51] v0.2.25 Formal Problem-Solving [7] 39489d1f0c32b521845429e1cb26c48639d8f823. For LLM fine-tuning, we use x Tuner [69] 081c8ca874bdbf7a7f8cd0a9e4cba503eaaa7bba with recipes detailed in Appendix F. For inference, we use v LLM [70] 0.6.0 with bfloat16 type and prefix caching enabled.
Experiment Setup	Yes	We use XTuner [69] for supervised fine-tuning (SFT) Qwen2.5-Math-7B [58] using the dataset recipes above (all tasks are uniformly mixed for training) and the following hyperparameters: Max Sequence Length: 8192 Variable-length Attention: True Pack to Maximal Length: True Sequence Parallel Size: 1 Batch size: 1 Gradient Accumulation: 64 Training Devices: 8 Train Epochs: 3 Optimizer: Adam W with learning rate 2 10 5, β = (0.9, 0.999), weight decay 0, maximal gradient norm 1, warpup ratio 0.03 and float16 mixed precision training. Learning Rate Scheduler: Warmup using Linear LR with start factor 10 5, then train using Cosine Annealing LR with ηmin = 0.0.