Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

What Makes In-context Learning Effective for Mathematical Reasoning

Authors: Jiayu Liu, Zhenya Huang, Chaokun Wang, Xunpeng Huang, Chengxiang Zhai, Enhong Chen

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through experiments on three representative benchmarks, two LLM backbones, and multiple few-shot settings, we verify that our LMS3 has superiority and achieves consistent improvements on all datasets, which existing methods have been unable to accomplish.
Researcher Affiliation	Academia	1State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China 2Institute of Artificial Intelligence, Hefei Comprehensive National Science Center 3Hong Kong University of Science and Technology 4Department of Computer Science, University of Illinois at Urbana-Champaign. Correspondence to: Zhenya Huang <EMAIL>.
Pseudocode	Yes	Algorithm 1 Our LMS3 Input: k-shot, Xtest, D, λ Output: Selected demonstration set Dk D 1: Calculate Sim(X), Stab(X), Score(X) for X D based on Eqs.(21)(22)(24). 2: Define Scorek D as the set of k samples with the smallest Score(X) values. 3: Define Simλ D as the set of λ samples with the smallest Sim(X) values. 4: Dk = {}. 5: for X Scorek do 6: If X Simλ do 7: Dk = Dk {X};
Open Source Code	Yes	Our code is available at https://github.com/Ljyustc/LMS3.
Open Datasets	Yes	We use three datasets that cover a variety of types and difficulty levels. MAWPS (Koncel-Kedziorski et al., 2016) consists of 2,373 elementary-level math word problems. GSM8K (Cobbe et al., 2021) is composed of 8,792 more challenging elementary problems with a higher number of steps. MATH (Hendrycks et al., 2021) is collected from high school math competition, containing 12,500 problems across seven categories, and is currently one of the most widely used benchmarks. [...] we include an additional experiment on Commonsense QA, a large-scale benchmark designed to evaluate commonsense reasoning task and has been widely used in ICL research (Ye et al., 2023; Qin et al., 2023; Min et al., 2022).
Dataset Splits	Yes	For GSM8K and MATH, we follow their publicly available train/test splits as D and Dtest. For MAWPS, we randomly split the dataset into an 8:2 ratio for D/Dtest. We summarize the dataset statistics in Table 6. For each dataset, we also randomly select 200 problems from D as the validation set to support the implementation of some baselines.
Hardware Specification	Yes	All experiments are conducted on a server with six NVIDIA RTX 3090 GPUs.
Software Dependencies	No	The paper mentions using 'Llama2-13B' and 'Llama3-8B' as backbones, but does not provide specific version numbers for any ancillary software dependencies like programming languages or libraries.
Experiment Setup	Yes	When implementing our LMS3, we set λ to 10% for Llama2-13B and 1% for Llama3-8B. The temperature for both LLMs is set to 0.8. For the baseline Influence, the size S of the subset is set to 20. For the baseline IDS, the number Q of iterations is set to 3.