HyperPrompt: Prompt-based Task-Conditioning of Transformers
Authors: Yun He, Steven Zheng, Yi Tay, Jai Gupta, Yu Du, Vamsi Aribandi, Zhe Zhao, Yaguang Li, Zhao Chen, Donald Metzler, Heng-Tze Cheng, Ed H. Chi
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive empirical experiments, we demonstrate that Hyper Prompt can achieve superior performances over strong T5 multi-task learning baselines and parameter-efficient adapter variants including Prompt-Tuning and Hyper Former++ on Natural Language Understanding benchmarks of GLUE and Super GLUE across many model sizes. |
| Researcher Affiliation | Collaboration | Yun He * 1 Huaixiu Steven Zheng * 2 Yi Tay 2 Jai Gupta 2 Yu Du 2 Vamsi Aribandi 2 Zhe Zhao 2 Ya Guang Li 2 Zhao Chen 3 Donald Metzler 2 Heng-Tze Cheng 2 Ed H. Chi 2 *Equal contribution 1Texas A&M University, work done as an intern at Google 2Google Research 3Waymo LLC. |
| Pseudocode | No | No pseudocode or algorithm blocks are present in the paper. |
| Open Source Code | No | The paper mentions using 'Mesh Tensorflow2 (Shazeer et al., 2018)' and 'T5 library3 (Raffel et al., 2019)' with footnotes linking to their respective GitHub repositories. However, it does not provide a direct statement or link for the open-source code of the specific 'Hyper Prompt' methodology described in this paper. |
| Open Datasets | Yes | Datasets. We evaluate the performance of the models on GLUE (Wang et al., 2018) and Super GLUE (Wang et al., 2019) respectively. |
| Dataset Splits | Yes | We save a checkpoint every 2000 steps for all models and follow the same convention as Raffel et al. (2019) in selecting the best checkpoint for each task. ...To calculate the attention mass over hyper-prompts per layer, we averaged the hyper-prompt attention softmax scores across 100 validation examples and each attention head in a layer... |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) are mentioned for the experimental setup. |
| Software Dependencies | No | The paper mentions 'Mesh Tensorflow2 (Shazeer et al., 2018)' and 'T5 library3 (Raffel et al., 2019)', and 'Adam optimizer (Kingma & Ba, 2014)', but no specific version numbers for these software components are provided in the text. |
| Experiment Setup | Yes | For all experiments, we train models 300K steps with a batch size of 128 and each batch is a mixture which samples each task proportionately to the number of examples in the dataset. Learning rate is a constant of 1e-3 with Adam optimizer (Kingma & Ba, 2014). For hyper-parameters tuning, the length of prompt l is selected from {12, 16, 20, 20, 24} at the encoder and {2, 4, 6, 8, 10, 12, 14, 16} at the decoder. The bottleneck dimension b in the transform matrices is set to d/r, where d is the model dimension of the T5 models and r is a reduction factor and selected from {16, 32, 64}. The dimension t of the layer-aware task embedding is selected from {32, 64, 128}. |