Are Transformers universal approximators of sequence-to-sequence functions?

Authors: Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, Sanjiv Kumar

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we establish that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models. Furthermore, using positional encodings, we circumvent the restriction of permutation equivariance, and show that Transformer models can universally approximate arbitrary continuous sequence-to-sequence functions on a compact domain. Interestingly, our proof techniques clearly highlight the different roles of the self-attention and the feed-forward layers in Transformers. In particular, we prove that fixed width self-attention layers can compute contextual mappings of the input sequences, playing a key role in the universal approximation property of Transformers. Based on this insight from our analysis, we consider other simpler alternatives to selfattention layers and empirically evaluate them.
Researcher Affiliation Collaboration Chulhee Yun MIT chulheey@mit.edu Srinadh Bhojanapalli Google Research NY bsrinadh@google.com Ankit Singh Rawat Google Research NY ankitsrawat@google.com Sashank J. Reddi Google Research NY sashank@google.com Sanjiv Kumar Google Research NY sanjivk@google.com
Pseudocode No The paper includes mathematical proofs and descriptions of functions but does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement or link indicating that the source code for the methodology described in the paper is openly available.
Open Datasets Yes We use English Wikipedia corpus and Books Corpus dataset (Zhu et al., 2015) for this pre-training.
Dataset Splits Yes The metrics are reported on the dev sets of these datasets.
Hardware Specification Yes Pre-training takes around 2 days on 16 TPUv3 chips.
Software Dependencies No The paper mentions using Adam optimizer, but does not provide specific version numbers for any software, libraries, or frameworks used in the experiments (e.g., Python, TensorFlow, PyTorch, CUDA).
Experiment Setup Yes We train it with the Adam optimizer, with .01 dropout and weight decay. We do pre-training for 250k steps with a batch size of 1024 and a max sequence length of 512.