Prompting a Pretrained Transformer Can Be a Universal Approximator

Authors: Aleksandar Petrov, Philip Torr, Adel Bibi

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical Despite the widespread adoption of prompting, prompt tuning and prefix-tuning of transformer models, our theoretical understanding of these fine-tuning methods remains limited. A key question is whether one can arbitrarily modify the behavior of a pretrained model by prompting or prefix-tuning it. Formally, whether prompting and prefix-tuning a pretrained model can universally approximate sequence-to-sequence functions. This paper answers in the affirmative and demonstrates that much smaller pretrained models than previously thought can be universal approximators when prefixed. In fact, prefix-tuning a single attention head is sufficient to approximate any continuous function making the attention mechanism uniquely suited for universal approximation. Moreover, any sequence-to-sequence function can be approximated by prefixing a transformer with depth linear in the sequence length. Beyond these density-type results, we also offer Jackson-type bounds on the length of the prefix needed to approximate a function to a desired precision.
Researcher Affiliation Academia Aleksandar Petrov 1 Philip H.S. Torr 1 Adel Bibi 1 1Department of Engineering Science, University of Oxford, UK.
Pseudocode No The paper presents theoretical results and proofs (e.g., Theorem 1, Lemma 1) but does not include any pseudocode or algorithm blocks.
Open Source Code No This is a theoretical paper focused on mathematical proofs and universal approximation; therefore, it does not provide open-source code for a specific methodology or implementation.
Open Datasets No This is a theoretical paper that mathematically defines concept classes and universal approximation properties, rather than conducting experiments on datasets. Therefore, no dataset is used or made publicly available for training.
Dataset Splits No This paper is theoretical and does not involve empirical experiments or dataset splits for training, validation, or testing.
Hardware Specification No This is a theoretical paper that does not report on empirical experiments; therefore, no hardware specifications are mentioned.
Software Dependencies No This is a theoretical paper that does not report on empirical experiments; therefore, no software dependencies with version numbers are mentioned.
Experiment Setup No This is a theoretical paper that does not describe any empirical experiments or their setup, including hyperparameters or training configurations.