Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

What Makes In-context Learning Effective for Mathematical Reasoning

Authors: Jiayu Liu, Zhenya Huang, Chaokun Wang, Xunpeng Huang, Chengxiang Zhai, Enhong Chen

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through experiments on three representative benchmarks, two LLM backbones, and multiple few-shot settings, we verify that our LMS3 has superiority and achieves consistent improvements on all datasets, which existing methods have been unable to accomplish.
Researcher Affiliation Academia 1State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China 2Institute of Artificial Intelligence, Hefei Comprehensive National Science Center 3Hong Kong University of Science and Technology 4Department of Computer Science, University of Illinois at Urbana-Champaign. Correspondence to: Zhenya Huang <EMAIL>.
Pseudocode Yes Algorithm 1 Our LMS3 Input: k-shot, Xtest, D, λ Output: Selected demonstration set Dk D 1: Calculate Sim(X), Stab(X), Score(X) for X D based on Eqs.(21)(22)(24). 2: Define Scorek D as the set of k samples with the smallest Score(X) values. 3: Define Simλ D as the set of λ samples with the smallest Sim(X) values. 4: Dk = {}. 5: for X Scorek do 6: If X Simλ do 7: Dk = Dk {X};
Open Source Code Yes Our code is available at https://github.com/Ljyustc/LMS3.
Open Datasets Yes We use three datasets that cover a variety of types and difficulty levels. MAWPS (Koncel-Kedziorski et al., 2016) consists of 2,373 elementary-level math word problems. GSM8K (Cobbe et al., 2021) is composed of 8,792 more challenging elementary problems with a higher number of steps. MATH (Hendrycks et al., 2021) is collected from high school math competition, containing 12,500 problems across seven categories, and is currently one of the most widely used benchmarks. [...] we include an additional experiment on Commonsense QA, a large-scale benchmark designed to evaluate commonsense reasoning task and has been widely used in ICL research (Ye et al., 2023; Qin et al., 2023; Min et al., 2022).
Dataset Splits Yes For GSM8K and MATH, we follow their publicly available train/test splits as D and Dtest. For MAWPS, we randomly split the dataset into an 8:2 ratio for D/Dtest. We summarize the dataset statistics in Table 6. For each dataset, we also randomly select 200 problems from D as the validation set to support the implementation of some baselines.
Hardware Specification Yes All experiments are conducted on a server with six NVIDIA RTX 3090 GPUs.
Software Dependencies No The paper mentions using 'Llama2-13B' and 'Llama3-8B' as backbones, but does not provide specific version numbers for any ancillary software dependencies like programming languages or libraries.
Experiment Setup Yes When implementing our LMS3, we set λ to 10% for Llama2-13B and 1% for Llama3-8B. The temperature for both LLMs is set to 0.8. For the baseline Influence, the size S of the subset is set to 20. For the baseline IDS, the number Q of iterations is set to 3.