Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs

Authors: Aldo Pareja, Nikhil Shivakumar Nayak, Hao Wang, Krishnateja Killamsetty, Shivchander Sudalairaj, Wenlong Zhao, Seungwook Han, Abhishek Bhandwaldar, Guangxuan Xu, Kai Xu, Ligong Han, Luke Inglis, Akash Srivastava

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we aim to bridge this gap by presenting a comprehensive study on supervised fine-tuning of LLMs using instruction-tuning datasets spanning diverse knowledge domains and skills. We focus on small-sized LLMs (3B to 7B parameters) for their cost-efficiency and accessibility. We explore various training configurations and strategies across four open-source pre-trained models. We provide detailed documentation of these configurations, revealing findings that challenge several common training practices, including hyperparameter recommendations from TULU (Wang et al., 2023b) and phased training recommended by Orca (Mitra et al., 2023). The code used for the experiments can be found here: https://github.com/instructlab/training.
Researcher Affiliation	Collaboration	Aldo Pareja1,2 , Nikhil Shivakumar Nayak1,2, Hao Wang1,2, Krishnateja Killamsetty3, Shivchander Sudalairaj1,2, Wenlong Zhao1,5 , Seungwook Han1,4 , Abhishek Bhandwaldar1,2, Guangxuan Xu1,2, Kai Xu1,2, Ligong Han1,2, Luke Inglis2,3, Akash Srivastava1,2 1Red Hat AI Innovation 2MIT-IBM Watson AI Lab 3IBM Research 4Massachusetts Institute of Technology 5University of Massachusetts Amherst
Pseudocode	No	The paper describes methods and strategies but does not include any explicitly labeled pseudocode or algorithm blocks. It refers to 'detailed documentation of these configurations' and details on 'training infrastructure and optimization techniques' in an appendix, but these do not present pseudocode.
Open Source Code	Yes	The code used for the experiments can be found here: https://github.com/instructlab/training.
Open Datasets	Yes	We fine-tuned them on five datasets: an instruction-following dataset with 308,343 samples, a foundational knowledge dataset with 231,178 samples, a complex skills dataset with 285,966 samples, the TULU mixture v2 dataset, and a domain-specific math, reasoning and coding dataset. ... We also conducted experiments with the TULU dataset (Wang et al., 2023b; Ivison et al., 2023), a diverse mix of complex, instruction-tuning data from human and GPT-4 sources... We curated a comprehensive dataset designed to progressively enhance the base models capabilities in instruction following (phase 00), foundational knowledge (phase 05), and complex skills (phase 10) (see Sudalairaj et al., 2024, for details).
Dataset Splits	No	The paper describes the datasets used for fine-tuning and mentions evaluation on various benchmarks (MMLU, MTBench, etc.) which typically have predefined test sets. It also discusses dataset partitioning based on difficulty for training strategies (e.g., 'Phase I: The bottom 50% of the data containing short sentences. Phase II: The top 50% of the data containing long sentences'), but it does not provide explicit training/validation/test splits for its own curated datasets or how they were used beyond the training phases.
Hardware Specification	No	fine-tuning LLMs poses unique challenges, as it often requires state-of-the-art clusters spanning multiple machines, each equipped with multiple GPUs, and advanced networking to optimize speed, memory efficiency, and scalability using frameworks such as Deepspeed (Rasley et al., 2020), Py Torch s FSDP (Zhao et al., 2023) or Megatron-LM (Narayanan et al., 2021). ... For instance, on 64 GPUs, we can process a batch of 3,840 samples in a single micro-batch, whereas on 1 GPU or 8 GPUs, we use gradient accumulation to approximate the same batch size.
Software Dependencies	No	Across all experiments, we use the Adam optimizer with β1 = 0.9 and β2 = 0.95. ... we used the Adam optimizer ... This approach allowed us to simulate very large batch sizes, to investigate their impact on model performance. ... We implement a variant of Multipack distributed sampler (Multipack Sampler, 2024)... Our variant extends the original design by accounting for padding, crucial for non-linear attention mechanisms like scaled dot-product attention (Vaswani et al., 2017), and clustering together samples of similar length. It ensures that even with padding, no GPU exceeds a pre-determined token capacity, which we calculate to maintain an expected micro-batch size that satisfies: ... This approach balances computational load across GPUs, resulting in improved training throughput and stability. Additionally, our sampler supports both linear attention mechanisms, such as Flash Attention (Dao et al., 2022), and traditional non-linear attention, making it versatile for various model architectures.
Experiment Setup	Yes	Table 1: Summary of hyperparameter configurations. Hyperparameter TULU TULU++ LAB Effective Batch Size 128 samples Same as TULU 3,840 or 7,680 samples Learning Rate Warmup ratio: 0.03 Same as TULU Warmup ratio: 0.01 (25 steps linear warmup) Scheduler Linear decay until the end of training No decay (constant rate after warmup) Number of Epochs 3 4 10 Goal Learning Rate 2 10 5 3 10 5 2 10 5 (also tested with higher rates). ... We investigate effective batch sizes of 128 (small), 3,840 (medium), and 7,680 (large) samples. ... We experiment with various goal learning rates: 1 10 6, 5 10 6, 2 10 5, 3 10 5, 4 10 5, 6 10 5, 8 10 5, and 1 10 4. Warmup steps are varied among 0, 25, and 100, corresponding to different numbers of samples processed before reaching the goal learning rate. ... Across all experiments, we use the Adam optimizer with β1 = 0.9 and β2 = 0.95.