reproducibilityindex.ai

A Closer Look at the Limitations of Instruction Tuning

Authors: Sreyan Ghosh, Chandra Kiran Reddy Evuru, Sonal Kumar, Ramaneswaran S, Deepali Aneja, Zeyu Jin, Ramani Duraiswami, Dinesh Manocha

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, through rigorous experiments and an in-depth analysis of the changes LLMs undergo through IT, we reveal various limitations of IT.
Researcher Affiliation	Collaboration	1University of Maryland, College Park, USA 2Adobe, USA 3NVIDIA, India.
Pseudocode	No	No pseudocode or algorithm blocks are present in the paper.
Open Source Code	No	All LLMs evaluated in this paper were trained using LLaMA-Factory (Zheng et al., 2024b).
Open Datasets	Yes	For fine-tuning with IT, we experiment with various synthetic and human-written IT datasets. For synthetic, we use Alpaca 52k with open-domain instruction-response pairs, constructed by prompting Chat GPT with an initial seed dataset with few samples (Taori et al., 2023) and Med Instruct 52k from the medical domain constructed in similar fashion (Zhang et al., 2023b). For human-written, we use LIMA 1K (Zhou et al., 2023) and databricks-dolly 15k (Conover et al., 2023). Finally, we also use Tulu-V2-Mix 326k (Ivison et al., 2023), which is an amalgamation of various open-source datasets.
Dataset Splits	No	All models are trained in a distributed manner for 3 epochs, with a learning rate of 5e-5 and an effective batch size of 32 (Taori et al., 2023).
Hardware Specification	No	We employ LLaMa-2 70B only in a fraction of experiments owing to compute constraints.
Software Dependencies	No	All LLMs evaluated in this paper were trained using LLaMA-Factory (Zheng et al., 2024b).
Experiment Setup	Yes	All models are trained in a distributed manner for 3 epochs, with a learning rate of 5e-5 and an effective batch size of 32 (Taori et al., 2023). For LFT, we use a standard rank of 8 (Hu et al., 2021) as we did not find a substantial change in performance by decreasing (2,4) or increasing it (16,32). Zhang et al. (2024) also shows that scaling rank is ineffective for LFT. For generation, we employ greedy decoding (i.e., zero temperature) in all experiments.