A Closer Look at the Limitations of Instruction Tuning

Authors: Sreyan Ghosh, Chandra Kiran Reddy Evuru, Sonal Kumar, Ramaneswaran S, Deepali Aneja, Zeyu Jin, Ramani Duraiswami, Dinesh Manocha

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, through rigorous experiments and an in-depth analysis of the changes LLMs undergo through IT, we reveal various limitations of IT.
Researcher Affiliation Collaboration 1University of Maryland, College Park, USA 2Adobe, USA 3NVIDIA, India.
Pseudocode No No pseudocode or algorithm blocks are present in the paper.
Open Source Code No All LLMs evaluated in this paper were trained using LLaMA-Factory (Zheng et al., 2024b).
Open Datasets Yes For fine-tuning with IT, we experiment with various synthetic and human-written IT datasets. For synthetic, we use Alpaca 52k with open-domain instruction-response pairs, constructed by prompting Chat GPT with an initial seed dataset with few samples (Taori et al., 2023) and Med Instruct 52k from the medical domain constructed in similar fashion (Zhang et al., 2023b). For human-written, we use LIMA 1K (Zhou et al., 2023) and databricks-dolly 15k (Conover et al., 2023). Finally, we also use Tulu-V2-Mix 326k (Ivison et al., 2023), which is an amalgamation of various open-source datasets.
Dataset Splits No All models are trained in a distributed manner for 3 epochs, with a learning rate of 5e-5 and an effective batch size of 32 (Taori et al., 2023).
Hardware Specification No We employ LLaMa-2 70B only in a fraction of experiments owing to compute constraints.
Software Dependencies No All LLMs evaluated in this paper were trained using LLaMA-Factory (Zheng et al., 2024b).
Experiment Setup Yes All models are trained in a distributed manner for 3 epochs, with a learning rate of 5e-5 and an effective batch size of 32 (Taori et al., 2023). For LFT, we use a standard rank of 8 (Hu et al., 2021) as we did not find a substantial change in performance by decreasing (2,4) or increasing it (16,32). Zhang et al. (2024) also shows that scaling rank is ineffective for LFT. For generation, we employ greedy decoding (i.e., zero temperature) in all experiments.