Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Learning to Solve Complex Problems via Dataset Decomposition

Authors: Wanru Zhao, Lucas Page-Caccia, Zhengyan Shi, Minseon Kim, Weijia Xu, Alessandro Sordoni

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on math datasets (MATH and AIME) and code generation datasets demonstrate that models trained with curricula generated by our approach exhibit superior performance compared to standard training on original datasets. 4 Experiments and Results
Researcher Affiliation	Collaboration	Wanru Zhao𝑎,𝑏 Lucas Caccia𝑏 Zhengyan Shi𝑏 Minseon Kim𝑏 Weijia Xu𝑏 Alessandro Sordoni𝑏,𝑐 𝑎University of Cambridge 𝑏Microsoft Research 𝑐Mila Quebec AI Institute
Pseudocode	Yes	Algorithm 1 Recursive Dataset Decomposition
Open Source Code	No	We will release the code when the paper gets accepted. Our implementation is based on open-source training and evaluation scripts and is reproducible.
Open Datasets	Yes	Our setup uses MATH [Hendrycks et al., 2021] and the American Invitational Mathematics Examination (AIME). MATH [Hendrycks et al., 2021] is a benchmark of competition math problems of varying difficulty. We evaluate on the same 500 samples in the prior work [Lightman et al., 2023]. AIME contains challenging mathematical competition problems. For training, we use AIME 24 as training set and AIME 25 as test set. Both datasets contain 30 problems that were used in the AIME in 2024 and 2025, respectively. More details can be found in Appendix A. ... Specifically, we use the Code Forces-Co Ts dataset [Penedo et al., 2025] (competitive programming solutions in C++) from Hugging Face Open-R1 [Hugging Face, 2025] for training. ... For evaluation, we employ the Human Eval benchmark [Chen et al., 2021] (Python function completion), which serves as an explicitly out-of-distribution scenario.
Dataset Splits	Yes	For training, we use AIME 24 as training set and AIME 25 as test set. Both datasets contain 30 problems that were used in the AIME in 2024 and 2025, respectively. More details can be found in Appendix A. ... We partitioned the decomposed AIME2024 dataset into five equal-sized bins (quintiles) based on our proposed difficulty measurement, shown in Table 6. ... We evaluate on the same 500 samples in the prior work [Lightman et al., 2023].
Hardware Specification	Yes	Our experiments are conducted on NVIDIA A100 GPUs with 80GB VRAM.
Software Dependencies	No	We train the models using bfloat16 precision with a learning rate of 10 5, warmed up linearly for 5% and then decayed to 0 over the rest of the training, following a cosine schedule. We use the Adam W optimizer [Loshchilov and Hutter, 2019].
Experiment Setup	Yes	We train each model for 5 epochs with a batch size of 16. We train the models using bfloat16 precision with a learning rate of 10 5, warmed up linearly for 5% and then decayed to 0 over the rest of the training, following a cosine schedule. We use the Adam W optimizer [Loshchilov and Hutter, 2019]. Unless otherwise specified, we evaluate with a temperature of 0 (greedy decoding) and measure accuracy (equivalent to pass@1). The results are averaged over three different training seeds.