Representational Strengths and Limitations of Transformers

Authors: Clayton Sanford, Daniel J. Hsu, Matus Telgarsky

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental While our investigation is purely approximation-theoretic, we also include in Appendix D a preliminary empirical study, showing that attention can learn q SA with vastly fewer samples than recurrent networks and MLPs; we feel this further emphasizes the fundamental value of q SA, and constitutes an exciting direction for future work. and section D Experiment details.
Researcher Affiliation Academia Clayton Sanford, Daniel Hsu Department of Computer Science Columbia University New York, NY 10027 {clayton,djhsu}@cs.columbia.edu Matus Telgarsky Courant Institute New York University New York, NY 10012 matus.telgarsky@nyu.edu
Pseudocode No The paper includes mathematical definitions, theorems, and proofs, but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any statement or link indicating the release of open-source code for the described methodology.
Open Datasets No Experiments used synthetic data, generated for q SA with n = 1000 training and testing examples, a sequence length N = 20, q = 3, with the individual inputs described in more detail as follows. The paper mentions using synthetic data but does not provide access information or state its public availability.
Dataset Splits No Experiments used synthetic data, generated for q SA with n = 1000 training and testing examples... The paper mentions training and testing but does not specify a validation split or its proportion.
Hardware Specification Yes Experiments... take a few minutes to run on an NVIDIA TITAN XP, and would be much faster on standard modern hardware.
Software Dependencies No Experiments fit the regression loss using Adam and a minibatch size of 32, with default precision...Figure 6 also contains an LSTM, which is a standard pytorch LSTM with 2 layers and a hidden state size 800... The paper mentions software like Adam and PyTorch LSTM but does not provide specific version numbers.
Experiment Setup Yes Experiments fit the regression loss using Adam and a minibatch size of 32, with default precision... The attention is identical to the description in the paper body, with the additional detail of the width and embedding dimension m being fixed to 100... Figure 6 also contains an LSTM, which is a standard pytorch LSTM with 2 layers and a hidden state size 800.