Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Atom of Thoughts for Markov LLM Test-Time Scaling

Authors: Fengwei Teng, Quan Shi, Zhaoyang Yu, Jiayi Zhang, Yuyu Luo, Chenglin Wu, Zhijiang Guo

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that AOT consistently outperforms existing baselines as computational budgets increase. We conduct main experiments across a variety of datasets spanning mathematics, code generation, and multi-hop question answering to demonstrate the cost-efficiency advantages of AOT as a general-purpose reasoning framework. Table 1 presents the main experimental results. Figure 3 further demonstrates that performance improves progressively with additional reasoning iterations.
Researcher Affiliation	Collaboration	Fengwei Teng1,2, Quan Shi3, Zhaoyang Yu2, Jiayi Zhang1,2, Yuyu Luo1, Chenglin Wu2 , Zhijiang Guo1 1HKUST(GZ), 2Deep Wisdom, 3Renmin University of China
Pseudocode	Yes	Listing 1: Math def direct(question: str): instruction = """You are a precise math question solver...""" Listing 2: Code def direct(question: str , contexts: str): instruction = """Solve the following problem step by step:...""" Listing 3: Multi-hop QA def direct(question: str , contexts: str): instruction = """Solve the following multi -hop question step by step:..."""
Open Source Code	Yes	We submit our code alongside this paper and will make it publicly available to facilitate reproducibility and future research. We have provided the code in the supplementary file.
Open Datasets	Yes	We evaluate AOT across representative benchmarks covering mathematical reasoning (MATH [14], GSM8K [8], AIME1), code generation (MBPP [1], Live Code Bench [19]), and multi-hop question answering tasks (Hotpot QA [39], Mu Si Que [30], and 2Wiki Multi Hop QA [15] preprocessed by Long Bench [2]). 1https://huggingface.co/datasets/Maxwell-Jia/AIME_2024
Dataset Splits	No	The paper states: "For the MATH dataset, we filter out questions with non-integer or non-decimal answers to ensure consistent evaluation. We evaluate the first 1,000 cases from MATH for efficiency, while assessing the remaining benchmarks in their entirety." This describes data usage but not explicit training/test/validation splits for reproducibility.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU models, CPU types, memory amounts) used for running experiments. It mentions using LLMs as backbones but does not specify the hardware they were run on.
Software Dependencies	No	The paper refers to LLMs (GPT-4o-mini, Deep Seek-V3, O3-mini, Deep Seek-R1) and implicitly uses Python for prompt definitions, but it does not specify any particular software versions (e.g., Python version, library versions like PyTorch, TensorFlow, CUDA) that would be needed to replicate the experiments.
Experiment Setup	Yes	All prompt templates used in Markov reasoning process for experiments are fully described in Appendix A.1. Key hyperparameters, including model temperature and Markov chain length, are detailed and discussed in Appendix A.2. We set the default temperature to 1.0 and the maximum Markov chain length to 3 for the main experiments to balance performance and efficiency while enabling scaling curves.