Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Instruct-SkillMix: A Powerful Pipeline for LLM Instruction Tuning

Authors: Simran Kaur, Simon Park, Anirudh Goyal, Sanjeev Arora

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Vanilla SFT (i.e., no PPO, DPO, or RL methods) on data generated from INSTRUCTSKILLMIX leads to strong gains on instruction following benchmarks such as Alpaca Eval 2.0, MT-Bench, and Wild Bench. ... Ablation studies also suggest plausible reasons for why creating open instructiontuning datasets via naive crowd-sourcing has proved difficult.
Researcher Affiliation Collaboration Simran Kaur1 , Simon Park1 , Anirudh Goyal2, Sanjeev Arora1 1 Princeton Language and Intelligence (PLI), Princeton University 2 Meta
Pseudocode No The method involves an automated interaction with a frontier LLM (GPT-4-Turbo). We ask the frontier LLM to first generate a list of topics that arise in instruction-following. For each topic returned by the LLM, we further prompt it to generate a list of skills that are needed to answer typical queries on that topic. Additionally, we ask the LLM to create a list of query types (e.g., Information Seeking ) that might arise in that topic. See Appendix L.4 for details about the prompts used, and Appendix K.2 for the list of all extracted skills. This process is described in text and diagrams (Figures 1 and 2) but not as structured pseudocode or an algorithm block.
Open Source Code Yes 1Source code can be found at https://github.com/princeton-pli/Instruct-Skill Mix.
Open Datasets Yes Share GPT (Chiang et al., 2023) contains conversations collected from a model-hosting website, whereas Open Assistant (K opf et al., 2023) and Dolly (Conover et al., 2023) contain crowd-sourced human data. Another intriguing method, popularized by Self-Instruct (Wang et al., 2023b) (and its variants, e.g., Alpaca (Taori et al., 2023)) leverages synthetic datasets.
Dataset Splits Yes We train for multiple epochs and select the best checkpoint by performance on 100 held-out questions. ... We randomly choose 100 held-out examples from our dataset.
Hardware Specification Yes Training a 7B model on 15 epochs of 1000 examples from INSTRUCT-SKILLMIX takes approximately 15 minutes on 4 H100 GPUs via Py Torch FSDP (Zhao et al., 2023).
Software Dependencies No We use the torchtune package (torchtune maintainers and contributors, 2024) to train all models, except for the Gemma models, which were trained with the MAmmo TH package (Yue et al., 2023). Specific version numbers for these packages are not provided.
Experiment Setup Yes In Table 12, we include the hyperparameters use in our experiments. We finetune each model using the Adam W optimizer. For every run, we use a learning rate schedule with a linear warmup of 0.03 and cosine decay to zero. For all experiments, we finetune for 15 epochs and store the checkpoint after each epoch, with the exception of the full Alpaca-52K dataset on which we only finetune for 3 epochs.