Prompting a Pretrained Transformer Can Be a Universal Approximator
Authors: Aleksandar Petrov, Philip Torr, Adel Bibi
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | Despite the widespread adoption of prompting, prompt tuning and prefix-tuning of transformer models, our theoretical understanding of these fine-tuning methods remains limited. A key question is whether one can arbitrarily modify the behavior of a pretrained model by prompting or prefix-tuning it. Formally, whether prompting and prefix-tuning a pretrained model can universally approximate sequence-to-sequence functions. This paper answers in the affirmative and demonstrates that much smaller pretrained models than previously thought can be universal approximators when prefixed. In fact, prefix-tuning a single attention head is sufficient to approximate any continuous function making the attention mechanism uniquely suited for universal approximation. Moreover, any sequence-to-sequence function can be approximated by prefixing a transformer with depth linear in the sequence length. Beyond these density-type results, we also offer Jackson-type bounds on the length of the prefix needed to approximate a function to a desired precision. |
| Researcher Affiliation | Academia | Aleksandar Petrov 1 Philip H.S. Torr 1 Adel Bibi 1 1Department of Engineering Science, University of Oxford, UK. |
| Pseudocode | No | The paper presents theoretical results and proofs (e.g., Theorem 1, Lemma 1) but does not include any pseudocode or algorithm blocks. |
| Open Source Code | No | This is a theoretical paper focused on mathematical proofs and universal approximation; therefore, it does not provide open-source code for a specific methodology or implementation. |
| Open Datasets | No | This is a theoretical paper that mathematically defines concept classes and universal approximation properties, rather than conducting experiments on datasets. Therefore, no dataset is used or made publicly available for training. |
| Dataset Splits | No | This paper is theoretical and does not involve empirical experiments or dataset splits for training, validation, or testing. |
| Hardware Specification | No | This is a theoretical paper that does not report on empirical experiments; therefore, no hardware specifications are mentioned. |
| Software Dependencies | No | This is a theoretical paper that does not report on empirical experiments; therefore, no software dependencies with version numbers are mentioned. |
| Experiment Setup | No | This is a theoretical paper that does not describe any empirical experiments or their setup, including hyperparameters or training configurations. |