reproducibilityindex.ai

Representational Strengths and Limitations of Transformers

Authors: Clayton Sanford, Daniel J. Hsu, Matus Telgarsky

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	While our investigation is purely approximation-theoretic, we also include in Appendix D a preliminary empirical study, showing that attention can learn q SA with vastly fewer samples than recurrent networks and MLPs; we feel this further emphasizes the fundamental value of q SA, and constitutes an exciting direction for future work. and section D Experiment details.
Researcher Affiliation	Academia	Clayton Sanford, Daniel Hsu Department of Computer Science Columbia University New York, NY 10027 {clayton,djhsu}@cs.columbia.edu Matus Telgarsky Courant Institute New York University New York, NY 10012 matus.telgarsky@nyu.edu
Pseudocode	No	The paper includes mathematical definitions, theorems, and proofs, but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any statement or link indicating the release of open-source code for the described methodology.
Open Datasets	No	Experiments used synthetic data, generated for q SA with n = 1000 training and testing examples, a sequence length N = 20, q = 3, with the individual inputs described in more detail as follows. The paper mentions using synthetic data but does not provide access information or state its public availability.
Dataset Splits	No	Experiments used synthetic data, generated for q SA with n = 1000 training and testing examples... The paper mentions training and testing but does not specify a validation split or its proportion.
Hardware Specification	Yes	Experiments... take a few minutes to run on an NVIDIA TITAN XP, and would be much faster on standard modern hardware.
Software Dependencies	No	Experiments fit the regression loss using Adam and a minibatch size of 32, with default precision...Figure 6 also contains an LSTM, which is a standard pytorch LSTM with 2 layers and a hidden state size 800... The paper mentions software like Adam and PyTorch LSTM but does not provide specific version numbers.
Experiment Setup	Yes	Experiments fit the regression loss using Adam and a minibatch size of 32, with default precision... The attention is identical to the description in the paper body, with the additional detail of the width and embedding dimension m being fixed to 100... Figure 6 also contains an LSTM, which is a standard pytorch LSTM with 2 layers and a hidden state size 800.