Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
What Makes In-context Learning Effective for Mathematical Reasoning
Authors: Jiayu Liu, Zhenya Huang, Chaokun Wang, Xunpeng Huang, Chengxiang Zhai, Enhong Chen
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through experiments on three representative benchmarks, two LLM backbones, and multiple few-shot settings, we verify that our LMS3 has superiority and achieves consistent improvements on all datasets, which existing methods have been unable to accomplish. |
| Researcher Affiliation | Academia | 1State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China 2Institute of Artificial Intelligence, Hefei Comprehensive National Science Center 3Hong Kong University of Science and Technology 4Department of Computer Science, University of Illinois at Urbana-Champaign. Correspondence to: Zhenya Huang <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Our LMS3 Input: k-shot, Xtest, D, λ Output: Selected demonstration set Dk D 1: Calculate Sim(X), Stab(X), Score(X) for X D based on Eqs.(21)(22)(24). 2: Define Scorek D as the set of k samples with the smallest Score(X) values. 3: Define Simλ D as the set of λ samples with the smallest Sim(X) values. 4: Dk = {}. 5: for X Scorek do 6: If X Simλ do 7: Dk = Dk {X}; |
| Open Source Code | Yes | Our code is available at https://github.com/Ljyustc/LMS3. |
| Open Datasets | Yes | We use three datasets that cover a variety of types and difficulty levels. MAWPS (Koncel-Kedziorski et al., 2016) consists of 2,373 elementary-level math word problems. GSM8K (Cobbe et al., 2021) is composed of 8,792 more challenging elementary problems with a higher number of steps. MATH (Hendrycks et al., 2021) is collected from high school math competition, containing 12,500 problems across seven categories, and is currently one of the most widely used benchmarks. [...] we include an additional experiment on Commonsense QA, a large-scale benchmark designed to evaluate commonsense reasoning task and has been widely used in ICL research (Ye et al., 2023; Qin et al., 2023; Min et al., 2022). |
| Dataset Splits | Yes | For GSM8K and MATH, we follow their publicly available train/test splits as D and Dtest. For MAWPS, we randomly split the dataset into an 8:2 ratio for D/Dtest. We summarize the dataset statistics in Table 6. For each dataset, we also randomly select 200 problems from D as the validation set to support the implementation of some baselines. |
| Hardware Specification | Yes | All experiments are conducted on a server with six NVIDIA RTX 3090 GPUs. |
| Software Dependencies | No | The paper mentions using 'Llama2-13B' and 'Llama3-8B' as backbones, but does not provide specific version numbers for any ancillary software dependencies like programming languages or libraries. |
| Experiment Setup | Yes | When implementing our LMS3, we set λ to 10% for Llama2-13B and 1% for Llama3-8B. The temperature for both LLMs is set to 0.8. For the baseline Influence, the size S of the subset is set to 20. For the baseline IDS, the number Q of iterations is set to 3. |