A Closer Look at the Limitations of Instruction Tuning
Authors: Sreyan Ghosh, Chandra Kiran Reddy Evuru, Sonal Kumar, Ramaneswaran S, Deepali Aneja, Zeyu Jin, Ramani Duraiswami, Dinesh Manocha
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, through rigorous experiments and an in-depth analysis of the changes LLMs undergo through IT, we reveal various limitations of IT. |
| Researcher Affiliation | Collaboration | 1University of Maryland, College Park, USA 2Adobe, USA 3NVIDIA, India. |
| Pseudocode | No | No pseudocode or algorithm blocks are present in the paper. |
| Open Source Code | No | All LLMs evaluated in this paper were trained using LLaMA-Factory (Zheng et al., 2024b). |
| Open Datasets | Yes | For fine-tuning with IT, we experiment with various synthetic and human-written IT datasets. For synthetic, we use Alpaca 52k with open-domain instruction-response pairs, constructed by prompting Chat GPT with an initial seed dataset with few samples (Taori et al., 2023) and Med Instruct 52k from the medical domain constructed in similar fashion (Zhang et al., 2023b). For human-written, we use LIMA 1K (Zhou et al., 2023) and databricks-dolly 15k (Conover et al., 2023). Finally, we also use Tulu-V2-Mix 326k (Ivison et al., 2023), which is an amalgamation of various open-source datasets. |
| Dataset Splits | No | All models are trained in a distributed manner for 3 epochs, with a learning rate of 5e-5 and an effective batch size of 32 (Taori et al., 2023). |
| Hardware Specification | No | We employ LLaMa-2 70B only in a fraction of experiments owing to compute constraints. |
| Software Dependencies | No | All LLMs evaluated in this paper were trained using LLaMA-Factory (Zheng et al., 2024b). |
| Experiment Setup | Yes | All models are trained in a distributed manner for 3 epochs, with a learning rate of 5e-5 and an effective batch size of 32 (Taori et al., 2023). For LFT, we use a standard rank of 8 (Hu et al., 2021) as we did not find a substantial change in performance by decreasing (2,4) or increasing it (16,32). Zhang et al. (2024) also shows that scaling rank is ineffective for LFT. For generation, we employ greedy decoding (i.e., zero temperature) in all experiments. |