reproducibilityindex.ai

Are Transformers universal approximators of sequence-to-sequence functions?

Authors: Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, Sanjiv Kumar

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we establish that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models. Furthermore, using positional encodings, we circumvent the restriction of permutation equivariance, and show that Transformer models can universally approximate arbitrary continuous sequence-to-sequence functions on a compact domain. Interestingly, our proof techniques clearly highlight the different roles of the self-attention and the feed-forward layers in Transformers. In particular, we prove that ﬁxed width self-attention layers can compute contextual mappings of the input sequences, playing a key role in the universal approximation property of Transformers. Based on this insight from our analysis, we consider other simpler alternatives to selfattention layers and empirically evaluate them.
Researcher Affiliation	Collaboration	Chulhee Yun MIT chulheey@mit.edu Srinadh Bhojanapalli Google Research NY bsrinadh@google.com Ankit Singh Rawat Google Research NY ankitsrawat@google.com Sashank J. Reddi Google Research NY sashank@google.com Sanjiv Kumar Google Research NY sanjivk@google.com
Pseudocode	No	The paper includes mathematical proofs and descriptions of functions but does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide an explicit statement or link indicating that the source code for the methodology described in the paper is openly available.
Open Datasets	Yes	We use English Wikipedia corpus and Books Corpus dataset (Zhu et al., 2015) for this pre-training.
Dataset Splits	Yes	The metrics are reported on the dev sets of these datasets.
Hardware Specification	Yes	Pre-training takes around 2 days on 16 TPUv3 chips.
Software Dependencies	No	The paper mentions using Adam optimizer, but does not provide specific version numbers for any software, libraries, or frameworks used in the experiments (e.g., Python, TensorFlow, PyTorch, CUDA).
Experiment Setup	Yes	We train it with the Adam optimizer, with .01 dropout and weight decay. We do pre-training for 250k steps with a batch size of 1024 and a max sequence length of 512.