Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Instruct-SkillMix: A Powerful Pipeline for LLM Instruction Tuning

Authors: Simran Kaur, Simon Park, Anirudh Goyal, Sanjeev Arora

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Vanilla SFT (i.e., no PPO, DPO, or RL methods) on data generated from INSTRUCTSKILLMIX leads to strong gains on instruction following benchmarks such as Alpaca Eval 2.0, MT-Bench, and Wild Bench. ... Ablation studies also suggest plausible reasons for why creating open instructiontuning datasets via naive crowd-sourcing has proved difficult.
Researcher Affiliation	Collaboration	Simran Kaur1 , Simon Park1 , Anirudh Goyal2, Sanjeev Arora1 1 Princeton Language and Intelligence (PLI), Princeton University 2 Meta
Pseudocode	No	The method involves an automated interaction with a frontier LLM (GPT-4-Turbo). We ask the frontier LLM to first generate a list of topics that arise in instruction-following. For each topic returned by the LLM, we further prompt it to generate a list of skills that are needed to answer typical queries on that topic. Additionally, we ask the LLM to create a list of query types (e.g., Information Seeking ) that might arise in that topic. See Appendix L.4 for details about the prompts used, and Appendix K.2 for the list of all extracted skills. This process is described in text and diagrams (Figures 1 and 2) but not as structured pseudocode or an algorithm block.
Open Source Code	Yes	1Source code can be found at https://github.com/princeton-pli/Instruct-Skill Mix.
Open Datasets	Yes	Share GPT (Chiang et al., 2023) contains conversations collected from a model-hosting website, whereas Open Assistant (K opf et al., 2023) and Dolly (Conover et al., 2023) contain crowd-sourced human data. Another intriguing method, popularized by Self-Instruct (Wang et al., 2023b) (and its variants, e.g., Alpaca (Taori et al., 2023)) leverages synthetic datasets.
Dataset Splits	Yes	We train for multiple epochs and select the best checkpoint by performance on 100 held-out questions. ... We randomly choose 100 held-out examples from our dataset.
Hardware Specification	Yes	Training a 7B model on 15 epochs of 1000 examples from INSTRUCT-SKILLMIX takes approximately 15 minutes on 4 H100 GPUs via Py Torch FSDP (Zhao et al., 2023).
Software Dependencies	No	We use the torchtune package (torchtune maintainers and contributors, 2024) to train all models, except for the Gemma models, which were trained with the MAmmo TH package (Yue et al., 2023). Specific version numbers for these packages are not provided.
Experiment Setup	Yes	In Table 12, we include the hyperparameters use in our experiments. We finetune each model using the Adam W optimizer. For every run, we use a learning rate schedule with a linear warmup of 0.03 and cosine decay to zero. For all experiments, we finetune for 15 epochs and store the checkpoint after each epoch, with the exception of the full Alpaca-52K dataset on which we only finetune for 3 epochs.