Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

A*-Thought: Efficient Reasoning via Bidirectional Compression for Low-Resource Settings

Authors: Xiaoang Xu, Shuo Wang, Xu Han, Zhenghao Liu, Huijia Wu, Peipei Li, Zhiyuan Liu, Maosong Sun, Zhaofeng He

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on several advanced math tasks show that A-Thought effectively balances performance and efficiency over a huge search space. Specifically, A-Thought can improve the performance of Qw Q-32B by 2.39 with low-budget and reduce the length of the output token by nearly 50% with high-budget.
Researcher Affiliation	Academia	1Beijing University of Posts and Telecommunications 2Dept. of Comp. Sci. & Tech., Tsinghua University, Beijing, China 3Northeastern University 4Institute for AI, Tsinghua University, Beijing, China 5Beijing National Research Center for Information Science and Technology
Pseudocode	Yes	Algorithm 1 A*-Thought algorithm for compressing lengthy Co Ts
Open Source Code	Yes	The code can be accessed at: https://github.com/AI9Stars/AStar-Thought.
Open Datasets	Yes	Benchmarks We employ the following mathematical reasoning tasks in our experiments, all of which demand complex reasoning capabilities from LRMs: MATH500 (Lightman et al., 2023), AMC23 (AMC, 2025), Olympiad Bench (He et al., 2024), and GSM8K (Cobbe et al., 2021). Training Data and Verification Model We utilize the long Co T data released by Muennighoff et al. (2025)6 as the original Co T data and employ the corresponding distilled model, s1.1-32B, as the verification model, following the approach detailed in Section 3.2.
Dataset Splits	No	The paper mentions using long CoT data (Muennighoff et al., 2025) for training and evaluates on specific benchmarks (MATH500, AMC23, Olympiad Bench, GSM8K), but it does not explicitly provide the specific training/test/validation splits used for its own experiments or for the 's1K-1.1' dataset.
Hardware Specification	Yes	Training was conducted on 8 NVIDIA A100 80G GPUs, using a per-GPU batch size of 1 and 8 gradient accumulation steps.
Software Dependencies	No	The paper mentions using a compact language model, specifically GPT-2, for importance estimation, but does not provide specific version numbers for any libraries, frameworks, or other ancillary software dependencies used for implementation.
Experiment Setup	Yes	Training Details We trained all models, including training-based baselines and our proposed method, for 3 epochs with a peak learning rate of 1 10 5 and a warm-up ratio of 0.1. Training was conducted on 8 NVIDIA A100 80G GPUs, using a per-GPU batch size of 1 and 8 gradient accumulation steps. For our proposed method, the default hyperparameters were set as α = 0.5 (Eq. 4) and β = 0.1 (Eq. 8). The lower bound for the verification depth, kmin, is set to 5, while the upper bound for the search tree depth, kmax, is set to 20. The exploration size W was set to 2.