Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Step-by-Step Reasoning for Math Problems via Twisted Sequential Monte Carlo

Authors: Shengyu Feng, Xiang Kong, shuang ma, Aonan Zhang, Dong Yin, Chong Wang, Ruoming Pang, Yiming Yang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate the advantages of our method across multiple math benchmarks, and also validate our theoretical analysis of both our approach and existing verification methods.
Researcher Affiliation Collaboration 1Language Technologies Institute, Carnegie Mellon University 2Apple
Pseudocode Yes B PSEUDOCODE FOR TSMC Here we summarize the pseudocode for our TSMC-based verification method, where CONCAT( ) represents the concatenation function, i.e., appending a new element to a list. Algorithm 1 TSMC for Verification
Open Source Code No The paper does not contain any explicit statements or links indicating that the authors have provided open-source code for the methodology described.
Open Datasets Yes Datasets. Building on prior work (Uesato et al., 2022; Lightman et al., 2024; Wang et al., 2023a), we assess our TSMC method using two widely used math datasets: GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021).
Dataset Splits Yes For GSM8K, we evaluate model performance on all testing instances. While for MATH, we follow Lightman et al. (2024) to select a representative subset of 500 testing instances, referred to as MATH500 in the following text. In this section, we demonstrate the generalizability of TSMC to other reasoning tasks beyond mathematical problems. Here we choose the quantitative natural language inference task in the Num GLUE benchmark (Mishra et al., 2022), which uses a Python program for multi-step reasoning. ...This forms a final dataset with 5924 training samples, 200 validation samples, and 200 testing samples.
Hardware Specification Yes We conducted our experiments for Llemma-7B on 8 NVIDIA H100 GPUs, and our experiments for Deep Seek-7B on 4 NVIDIA RTX A6000 GPUs.
Software Dependencies No The paper lists training hyperparameters in Table 2, such as Learning rate, Batch size, # Epochs, Warmup ratio, Max. length, and Dtype (BF16), but does not specify software dependencies like programming language versions or library versions (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes Table 2: The summary of training hyperparameters for all models. During the inference time, we generate the solution using top-K sampling with K = 20 and set the temperature as 0.7. The maximum length of the solution is fixed as 768. We apply CTL loss (Zhao et al., 2024) on the step-level. The steps are separated by double newline indicators, that is, \n\n, and the value function is trained on the token corresponding to the second newline indicator, along with the end-of-sentence token <eos>.