Separations in the Representational Capabilities of Transformers and Recurrent Architectures
Authors: Satwik Bhattamishra, Michael Hahn, Phil Blunsom, Varun Kanade
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We also empirically investigated the performance of Transformers and standard recurrent models [26], including recently proposed state-space models [22, 21] on these tasks. The observed behavior is along the lines indicated by our theoretical results. In this section, we investigate the performance of Transformers and recurrent models on tasks such as Index Lookup and recognizing bounded Dyck languages on sequences of small lengths (< 1000). |
| Researcher Affiliation | Collaboration | Satwik Bhattamishra1 Michael Hahn2 Phil Blunsom1,3 Varun Kanade1 1University of Oxford 2Saarland University 3Cohere |
| Pseudocode | No | The paper describes methods and constructions in text but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Yes we have included the code in the supplementary material. |
| Open Datasets | No | The paper describes custom data generation methods (e.g., 'To create each example, we first sample a length k uniformly...'), implying the datasets are generated, not sourced from existing public repositories with specific access links or citations. |
| Dataset Splits | No | The paper mentions 'Validation Accuracy (%)' in figures and 'The models are evaluated on 5000 examples for each task' for evaluation, but does not explicitly specify the size or percentage of a dedicated validation split separate from the final evaluation set. |
| Hardware Specification | Yes | All our experiments were conducted using 8 NVIDIA Tesla V100 GPUs each with 16GB memory and 16 NVIDIA GTX 1080 Ti GPUs each with 12GB memory. |
| Software Dependencies | No | The paper mentions software like 'Py Torch' and 'Huggingface Transformers library' but does not specify their version numbers. |
| Experiment Setup | Yes | We train the models with cross-entropy loss using the Adam optimizer [32]. The models are trained for up to 250k steps where at each step we sample a fresh batch of 64 training examples resulting in 16 million examples over 250k steps. The models are evaluated on 5000 examples for each task. For each model, we tune the various hyperparameters, notably across learning rates {1e-2, 5e-3, . . . , 1e-6} to find the best-performing model. |