Representational Strengths and Limitations of Transformers
Authors: Clayton Sanford, Daniel J. Hsu, Matus Telgarsky
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | While our investigation is purely approximation-theoretic, we also include in Appendix D a preliminary empirical study, showing that attention can learn q SA with vastly fewer samples than recurrent networks and MLPs; we feel this further emphasizes the fundamental value of q SA, and constitutes an exciting direction for future work. and section D Experiment details. |
| Researcher Affiliation | Academia | Clayton Sanford, Daniel Hsu Department of Computer Science Columbia University New York, NY 10027 {clayton,djhsu}@cs.columbia.edu Matus Telgarsky Courant Institute New York University New York, NY 10012 matus.telgarsky@nyu.edu |
| Pseudocode | No | The paper includes mathematical definitions, theorems, and proofs, but it does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any statement or link indicating the release of open-source code for the described methodology. |
| Open Datasets | No | Experiments used synthetic data, generated for q SA with n = 1000 training and testing examples, a sequence length N = 20, q = 3, with the individual inputs described in more detail as follows. The paper mentions using synthetic data but does not provide access information or state its public availability. |
| Dataset Splits | No | Experiments used synthetic data, generated for q SA with n = 1000 training and testing examples... The paper mentions training and testing but does not specify a validation split or its proportion. |
| Hardware Specification | Yes | Experiments... take a few minutes to run on an NVIDIA TITAN XP, and would be much faster on standard modern hardware. |
| Software Dependencies | No | Experiments fit the regression loss using Adam and a minibatch size of 32, with default precision...Figure 6 also contains an LSTM, which is a standard pytorch LSTM with 2 layers and a hidden state size 800... The paper mentions software like Adam and PyTorch LSTM but does not provide specific version numbers. |
| Experiment Setup | Yes | Experiments fit the regression loss using Adam and a minibatch size of 32, with default precision... The attention is identical to the description in the paper body, with the additional detail of the width and embedding dimension m being fixed to 100... Figure 6 also contains an LSTM, which is a standard pytorch LSTM with 2 layers and a hidden state size 800. |