HyperTuning: Toward Adapting Large Language Models without Back-propagation
Authors: Jason Phang, Yi Mao, Pengcheng He, Weizhu Chen
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Hyper T5 on P3, Meta ICL and Super-Natural Instructions datasets, and show that it can effectively generate parameters for unseen tasks. |
| Researcher Affiliation | Collaboration | 1Center for Data Science, New York University, NY, USA 2Eleuther AI 3Microsoft Azure AI, WA, USA. Correspondence to: Jason phang <jasonphang@nyu.edu>. |
| Pseudocode | Yes | Additional architectural details and pseudo-code for both Hyper T5-Prefix and Hyper T5-Lo RA models can be found in Appendix C. |
| Open Source Code | No | The paper does not provide an explicit statement or link to its own open-source code for the described methodology. |
| Open Datasets | Yes | To demonstrate the generality of our approach, we conduct experiments on three different multi-task training datasets, each with different held-out tasks and evaluation protocols. Public Pool of Prompts (P3) (Sanh et al., 2022) [...] Meta ICL (Min et al., 2022) [...] Super-Natural Instructions (S-NI) (Wang et al., 2022) |
| Dataset Splits | No | The paper mentions 'held-out tasks' for evaluation and 'dev' in table headers, which typically imply a validation/development set, but it does not explicitly provide specific percentages or counts for training/validation/test splits, nor does it define the 'dev' set's size or role explicitly as a validation set for hyperparameter tuning separate from the test set. |
| Hardware Specification | No | The paper does not specify any particular hardware used for running the experiments (e.g., specific GPU or CPU models, memory, or cloud instances). |
| Software Dependencies | No | The paper mentions software like '1-bit Adam', 'ZeRO', and 'Transformers' but does not provide specific version numbers for these components. |
| Experiment Setup | Yes | All experiments are trained with 1-bit Adam (Dettmers et al., 2022) and batch size of 256, a learning rate of 5e-5, and a linear decay schedule. Training was performed with Ze RO (Rajbhandari et al., 2020) and Transformers (Wolf et al., 2020). For hypermodels, the hypermodel s max input sequence length is 1024 tokens and the downstream model s max input sequence length is 384 tokens. [...] The max target sequence length is set to 128 for all experiments. |