Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Approximation Rate of the Transformer Architecture for Sequence Modeling
Authors: Haotian Jiang, Qianxiao Li
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our analysis reveals that the approximation capacity is governed by a low-rank structure within the pairwise coupling of the target s temporal features. Empirical validation confirms that the findings observed under theoretical settings also hold true in practical applications. 2. We conduct a comparative analysis between the Transformer and RNNs, aiming to identify specific types of temporal structures where one model excels or underperforms compared to the other. |
| Researcher Affiliation | Collaboration | CNRS@CREATE LTD, 1 Create Way, #08-01 CREATE Tower, Singapore 138602 Department of Mathematics, Institute for Functional Intelligent Materials, National University of Singapore |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not explicitly state that the authors are releasing their code or provide a link to a code repository. |
| Open Datasets | Yes | We next analyze a practical example, focusing on the Vision Transformer (Vi T) model with the CIFAR10 dataset. For the WMT2014 English-German dataset, we use the original Transformer model as proposed in [30]. |
| Dataset Splits | Yes | We train enough epochs to ensure the loss does not decrease so that we can use the training error to estimate the approximation error. For the Vi T experiment, we use the Vi T B16 model. For the WMT2014 English-German dataset, we use the original Transformer model as proposed in [30]. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It mentions training ViT models but no hardware specs. |
| Software Dependencies | No | The paper mentions using "Py Torch default initialization" but does not specify the version number of PyTorch or any other software dependencies with their versions. |
| Experiment Setup | Yes | The feed-forward part is constructed using a dense network with a width of 128 and a depth of 3 to ensure it has enough expressiveness. Moreover, we have n = dv = 32 and mh, which range from 1 to 16, to construct a model with different ranks. We use Py Torch default initialization and use normal training procedures with Adam Optimizer. We train enough epochs to ensure the loss does not decrease so that we can use the training error to estimate the approximation error. |