Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AdaSTaR: Adaptive Data Sampling for Training Self-Taught Reasoners

Authors: Reiss Koh, Wonbeen Oh, Jaein Jang, MinHyung Lee, Hyeongjin Kim, Ah Kim, Joonkee Kim, Junghyun Lee, Taehyeon Kim, Se-Young Yun

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Across six benchmarks, Ada STa R achieves best test accuracy in all instances (6/6) and reduces training FLOPs by an average of 58.6% against an extensive list of baselines.
Researcher Affiliation	Collaboration	Woosung Koh1,3, Wonbeen Oh3, Jaein Jang3, Min Hyung Lee3, Hyeongjin Kim3, Ah Yeon Kim3, Joonkee Kim2, Junghyun Lee1, Taehyeon Kim2 , Se-Young Yun1 1KAIST AI, 2LG AI Research, 3Yonsei University
Pseudocode	Yes	Algorithm 1: Ada STa R Input: D, πt=1 θ , e /* Ada D ( §3.1; lines 1-14) / 1 t dict{i : ti = 0}N i=1 ; 2 w dict{i : wi = 0}N i=1 ; 3 init Hie Min Heap(D, t, w) ; 4 for iteration t = 1, do 5 Dt + , m ← 0 ; 6 wtmp dict{i : wtmp i = 0}N i=1 ; 7 while \|Dt +\| < βt do 8 i ← Hie Min Heap.peek_next ; 9 m ← m + 1 ; 10 for sample k = 1, . . . , K do 11 ˆci, ˆyi ← πt θ(e, xi); 12 wtmp i ← k − 1 k wtmp i + 1 k I[ˆyi = yi]; 13 if ˆyi = yi then 14 Dt + ← Dt + ∪ { xi, ˆci, ˆyi } ; / Ada C ( §3.2; lines 15-19) */ 15 α, πt+1 θ ← Train(πt θ, Dt +) ; 16 for 1, . . . , mα2 do 17 i ← Hie Min Heap.pop ; 18 ti ← t, wi ← wtmp i ; 19 Hie Min Heap.push(i, ti, wi) ;
Open Source Code	Yes	github.com/reiss-koh/Ada STa R
Open Datasets	Yes	Datasets. We attempt to get a wide coverage of reasoning tasks by using six well-known datasets. We use the AI2 Reasoning Challenge s Challenge set (ARC-C; Clark et al., 2018) for scientific reasoning, Commonsense QA (CQA; Talmor et al., 2019) for commonsense reasoning, and CLadder 1.5 (Jin et al., 2023) for causal reasoning. For natural language inference reasoning we use Adversarial NLI (ANLI; Nie et al., 2020). For mathematical reasoning we use GSM8K (Cobbe et al., 2021) and SVAMP (Patel et al., 2021).
Dataset Splits	Yes	Dataset Train set Test set ARC-C 1,418 1,172 CQA 9,741 1,140 CLadder 1.5 8,089 2,023 ANLI (R1) 10,000 1,000 GSM8K 7,473 1,319 SVAMP 800 300 Table 4: Train and test set sizes for each dataset
Hardware Specification	Yes	We primarily conduct our experiments on numerous nodes with 8 RTX 3090 24G, with equivalent hardware specifications across nodes. For a few compute heavy experiments we use nodes with 8 A100 40G.
Software Dependencies	No	The paper does not explicitly state specific version numbers for key software components such as programming languages, libraries, or frameworks. It mentions 'Adam' as the optimizer and 'bf16' for model precision, but these are types rather than specific versioned software dependencies.
Experiment Setup	Yes	Parameters ARC-C CQA CLadder 1.5 ANLI GSM8K SVAMP Batch size 8 8 8 8 8 8 Learning rate 10−5 10−5 10−5 10−5 10−5 10−5 Weight decay 0.01 0.01 0.01 0.01 0.01 0.01 Warm up steps 100 100 100 100 100 100 Optimizer Adam Adam Adam Adam Adam Adam Model precision bf16 bf16 bf16 bf16 bf16 bf16 Samples for self consistency 5 5 5 5 5 5 Inference decoding temperature 1.0 1.0 1.0 1.0 1.0 1.0 Evaluation decoding temperature 0 0 0 0 0 0 Rationalization (default) True True True True False False Table 3: Hyperparameters across datasets.