Approximation Rate of the Transformer Architecture for Sequence Modeling
Authors: Haotian Jiang, Qianxiao Li
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our analysis reveals that the approximation capacity is governed by a low-rank structure within the pairwise coupling of the target s temporal features. Empirical validation confirms that the findings observed under theoretical settings also hold true in practical applications. 2. We conduct a comparative analysis between the Transformer and RNNs, aiming to identify specific types of temporal structures where one model excels or underperforms compared to the other. |
| Researcher Affiliation | Collaboration | CNRS@CREATE LTD, 1 Create Way, #08-01 CREATE Tower, Singapore 138602 Department of Mathematics, Institute for Functional Intelligent Materials, National University of Singapore |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not explicitly state that the authors are releasing their code or provide a link to a code repository. |
| Open Datasets | Yes | We next analyze a practical example, focusing on the Vision Transformer (Vi T) model with the CIFAR10 dataset. For the WMT2014 English-German dataset, we use the original Transformer model as proposed in [30]. |
| Dataset Splits | Yes | We train enough epochs to ensure the loss does not decrease so that we can use the training error to estimate the approximation error. For the Vi T experiment, we use the Vi T B16 model. For the WMT2014 English-German dataset, we use the original Transformer model as proposed in [30]. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It mentions training ViT models but no hardware specs. |
| Software Dependencies | No | The paper mentions using "Py Torch default initialization" but does not specify the version number of PyTorch or any other software dependencies with their versions. |
| Experiment Setup | Yes | The feed-forward part is constructed using a dense network with a width of 128 and a depth of 3 to ensure it has enough expressiveness. Moreover, we have n = dv = 32 and mh, which range from 1 to 16, to construct a model with different ranks. We use Py Torch default initialization and use normal training procedures with Adam Optimizer. We train enough epochs to ensure the loss does not decrease so that we can use the training error to estimate the approximation error. |