Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Instruct-SkillMix: A Powerful Pipeline for LLM Instruction Tuning
Authors: Simran Kaur, Simon Park, Anirudh Goyal, Sanjeev Arora
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Vanilla SFT (i.e., no PPO, DPO, or RL methods) on data generated from INSTRUCTSKILLMIX leads to strong gains on instruction following benchmarks such as Alpaca Eval 2.0, MT-Bench, and Wild Bench. ... Ablation studies also suggest plausible reasons for why creating open instructiontuning datasets via naive crowd-sourcing has proved difficult. |
| Researcher Affiliation | Collaboration | Simran Kaur1 , Simon Park1 , Anirudh Goyal2, Sanjeev Arora1 1 Princeton Language and Intelligence (PLI), Princeton University 2 Meta |
| Pseudocode | No | The method involves an automated interaction with a frontier LLM (GPT-4-Turbo). We ask the frontier LLM to first generate a list of topics that arise in instruction-following. For each topic returned by the LLM, we further prompt it to generate a list of skills that are needed to answer typical queries on that topic. Additionally, we ask the LLM to create a list of query types (e.g., Information Seeking ) that might arise in that topic. See Appendix L.4 for details about the prompts used, and Appendix K.2 for the list of all extracted skills. This process is described in text and diagrams (Figures 1 and 2) but not as structured pseudocode or an algorithm block. |
| Open Source Code | Yes | 1Source code can be found at https://github.com/princeton-pli/Instruct-Skill Mix. |
| Open Datasets | Yes | Share GPT (Chiang et al., 2023) contains conversations collected from a model-hosting website, whereas Open Assistant (K opf et al., 2023) and Dolly (Conover et al., 2023) contain crowd-sourced human data. Another intriguing method, popularized by Self-Instruct (Wang et al., 2023b) (and its variants, e.g., Alpaca (Taori et al., 2023)) leverages synthetic datasets. |
| Dataset Splits | Yes | We train for multiple epochs and select the best checkpoint by performance on 100 held-out questions. ... We randomly choose 100 held-out examples from our dataset. |
| Hardware Specification | Yes | Training a 7B model on 15 epochs of 1000 examples from INSTRUCT-SKILLMIX takes approximately 15 minutes on 4 H100 GPUs via Py Torch FSDP (Zhao et al., 2023). |
| Software Dependencies | No | We use the torchtune package (torchtune maintainers and contributors, 2024) to train all models, except for the Gemma models, which were trained with the MAmmo TH package (Yue et al., 2023). Specific version numbers for these packages are not provided. |
| Experiment Setup | Yes | In Table 12, we include the hyperparameters use in our experiments. We finetune each model using the Adam W optimizer. For every run, we use a learning rate schedule with a linear warmup of 0.03 and cosine decay to zero. For all experiments, we finetune for 15 epochs and store the checkpoint after each epoch, with the exception of the full Alpaca-52K dataset on which we only finetune for 3 epochs. |