Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
AdaSTaR: Adaptive Data Sampling for Training Self-Taught Reasoners
Authors: Reiss Koh, Wonbeen Oh, Jaein Jang, MinHyung Lee, Hyeongjin Kim, Ah Kim, Joonkee Kim, Junghyun Lee, Taehyeon Kim, Se-Young Yun
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Across six benchmarks, Ada STa R achieves best test accuracy in all instances (6/6) and reduces training FLOPs by an average of 58.6% against an extensive list of baselines. |
| Researcher Affiliation | Collaboration | Woosung Koh1,3, Wonbeen Oh3, Jaein Jang3, Min Hyung Lee3, Hyeongjin Kim3, Ah Yeon Kim3, Joonkee Kim2, Junghyun Lee1, Taehyeon Kim2 , Se-Young Yun1 1KAIST AI, 2LG AI Research, 3Yonsei University |
| Pseudocode | Yes | Algorithm 1: Ada STa R Input: D, πt=1 θ , e /* Ada D ( §3.1; lines 1-14) */ 1 t dict{i : ti = 0}N i=1 ; 2 w dict{i : wi = 0}N i=1 ; 3 init Hie Min Heap(D, t, w) ; 4 for iteration t = 1, do 5 Dt + , m ← 0 ; 6 wtmp dict{i : wtmp i = 0}N i=1 ; 7 while |Dt +| < βt do 8 i ← Hie Min Heap.peek_next ; 9 m ← m + 1 ; 10 for sample k = 1, . . . , K do 11 ˆci, ˆyi ← πt θ(e, xi); 12 wtmp i ← k − 1 k wtmp i + 1 k I[ˆyi = yi]; 13 if ˆyi = yi then 14 Dt + ← Dt + ∪ { xi, ˆci, ˆyi } ; /* Ada C ( §3.2; lines 15-19) */ 15 α, πt+1 θ ← Train(πt θ, Dt +) ; 16 for 1, . . . , mα2 do 17 i ← Hie Min Heap.pop ; 18 ti ← t, wi ← wtmp i ; 19 Hie Min Heap.push(i, ti, wi) ; |
| Open Source Code | Yes | github.com/reiss-koh/Ada STa R |
| Open Datasets | Yes | Datasets. We attempt to get a wide coverage of reasoning tasks by using six well-known datasets. We use the AI2 Reasoning Challenge s Challenge set (ARC-C; Clark et al., 2018) for scientific reasoning, Commonsense QA (CQA; Talmor et al., 2019) for commonsense reasoning, and CLadder 1.5 (Jin et al., 2023) for causal reasoning. For natural language inference reasoning we use Adversarial NLI (ANLI; Nie et al., 2020). For mathematical reasoning we use GSM8K (Cobbe et al., 2021) and SVAMP (Patel et al., 2021). |
| Dataset Splits | Yes | Dataset Train set Test set ARC-C 1,418 1,172 CQA 9,741 1,140 CLadder 1.5 8,089 2,023 ANLI (R1) 10,000 1,000 GSM8K 7,473 1,319 SVAMP 800 300 Table 4: Train and test set sizes for each dataset |
| Hardware Specification | Yes | We primarily conduct our experiments on numerous nodes with 8 RTX 3090 24G, with equivalent hardware specifications across nodes. For a few compute heavy experiments we use nodes with 8 A100 40G. |
| Software Dependencies | No | The paper does not explicitly state specific version numbers for key software components such as programming languages, libraries, or frameworks. It mentions 'Adam' as the optimizer and 'bf16' for model precision, but these are types rather than specific versioned software dependencies. |
| Experiment Setup | Yes | Parameters ARC-C CQA CLadder 1.5 ANLI GSM8K SVAMP Batch size 8 8 8 8 8 8 Learning rate 10−5 10−5 10−5 10−5 10−5 10−5 Weight decay 0.01 0.01 0.01 0.01 0.01 0.01 Warm up steps 100 100 100 100 100 100 Optimizer Adam Adam Adam Adam Adam Adam Model precision bf16 bf16 bf16 bf16 bf16 bf16 Samples for self consistency 5 5 5 5 5 5 Inference decoding temperature 1.0 1.0 1.0 1.0 1.0 1.0 Evaluation decoding temperature 0 0 0 0 0 0 Rationalization (default) True True True True False False Table 3: Hyperparameters across datasets. |