HyperTuning: Toward Adapting Large Language Models without Back-propagation

Authors: Jason Phang, Yi Mao, Pengcheng He, Weizhu Chen

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Hyper T5 on P3, Meta ICL and Super-Natural Instructions datasets, and show that it can effectively generate parameters for unseen tasks.
Researcher Affiliation Collaboration 1Center for Data Science, New York University, NY, USA 2Eleuther AI 3Microsoft Azure AI, WA, USA. Correspondence to: Jason phang <jasonphang@nyu.edu>.
Pseudocode Yes Additional architectural details and pseudo-code for both Hyper T5-Prefix and Hyper T5-Lo RA models can be found in Appendix C.
Open Source Code No The paper does not provide an explicit statement or link to its own open-source code for the described methodology.
Open Datasets Yes To demonstrate the generality of our approach, we conduct experiments on three different multi-task training datasets, each with different held-out tasks and evaluation protocols. Public Pool of Prompts (P3) (Sanh et al., 2022) [...] Meta ICL (Min et al., 2022) [...] Super-Natural Instructions (S-NI) (Wang et al., 2022)
Dataset Splits No The paper mentions 'held-out tasks' for evaluation and 'dev' in table headers, which typically imply a validation/development set, but it does not explicitly provide specific percentages or counts for training/validation/test splits, nor does it define the 'dev' set's size or role explicitly as a validation set for hyperparameter tuning separate from the test set.
Hardware Specification No The paper does not specify any particular hardware used for running the experiments (e.g., specific GPU or CPU models, memory, or cloud instances).
Software Dependencies No The paper mentions software like '1-bit Adam', 'ZeRO', and 'Transformers' but does not provide specific version numbers for these components.
Experiment Setup Yes All experiments are trained with 1-bit Adam (Dettmers et al., 2022) and batch size of 256, a learning rate of 5e-5, and a linear decay schedule. Training was performed with Ze RO (Rajbhandari et al., 2020) and Transformers (Wolf et al., 2020). For hypermodels, the hypermodel s max input sequence length is 1024 tokens and the downstream model s max input sequence length is 384 tokens. [...] The max target sequence length is set to 128 for all experiments.