Approximation Rate of the Transformer Architecture for Sequence Modeling

Authors: Haotian Jiang, Qianxiao Li

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our analysis reveals that the approximation capacity is governed by a low-rank structure within the pairwise coupling of the target s temporal features. Empirical validation confirms that the findings observed under theoretical settings also hold true in practical applications. 2. We conduct a comparative analysis between the Transformer and RNNs, aiming to identify specific types of temporal structures where one model excels or underperforms compared to the other.
Researcher Affiliation Collaboration CNRS@CREATE LTD, 1 Create Way, #08-01 CREATE Tower, Singapore 138602 Department of Mathematics, Institute for Functional Intelligent Materials, National University of Singapore
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not explicitly state that the authors are releasing their code or provide a link to a code repository.
Open Datasets Yes We next analyze a practical example, focusing on the Vision Transformer (Vi T) model with the CIFAR10 dataset. For the WMT2014 English-German dataset, we use the original Transformer model as proposed in [30].
Dataset Splits Yes We train enough epochs to ensure the loss does not decrease so that we can use the training error to estimate the approximation error. For the Vi T experiment, we use the Vi T B16 model. For the WMT2014 English-German dataset, we use the original Transformer model as proposed in [30].
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It mentions training ViT models but no hardware specs.
Software Dependencies No The paper mentions using "Py Torch default initialization" but does not specify the version number of PyTorch or any other software dependencies with their versions.
Experiment Setup Yes The feed-forward part is constructed using a dense network with a width of 128 and a depth of 3 to ensure it has enough expressiveness. Moreover, we have n = dv = 32 and mh, which range from 1 to 16, to construct a model with different ranks. We use Py Torch default initialization and use normal training procedures with Adam Optimizer. We train enough epochs to ensure the loss does not decrease so that we can use the training error to estimate the approximation error.