Separations in the Representational Capabilities of Transformers and Recurrent Architectures

Authors: Satwik Bhattamishra, Michael Hahn, Phil Blunsom, Varun Kanade

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We also empirically investigated the performance of Transformers and standard recurrent models [26], including recently proposed state-space models [22, 21] on these tasks. The observed behavior is along the lines indicated by our theoretical results. In this section, we investigate the performance of Transformers and recurrent models on tasks such as Index Lookup and recognizing bounded Dyck languages on sequences of small lengths (< 1000).
Researcher Affiliation Collaboration Satwik Bhattamishra1 Michael Hahn2 Phil Blunsom1,3 Varun Kanade1 1University of Oxford 2Saarland University 3Cohere
Pseudocode No The paper describes methods and constructions in text but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Yes we have included the code in the supplementary material.
Open Datasets No The paper describes custom data generation methods (e.g., 'To create each example, we first sample a length k uniformly...'), implying the datasets are generated, not sourced from existing public repositories with specific access links or citations.
Dataset Splits No The paper mentions 'Validation Accuracy (%)' in figures and 'The models are evaluated on 5000 examples for each task' for evaluation, but does not explicitly specify the size or percentage of a dedicated validation split separate from the final evaluation set.
Hardware Specification Yes All our experiments were conducted using 8 NVIDIA Tesla V100 GPUs each with 16GB memory and 16 NVIDIA GTX 1080 Ti GPUs each with 12GB memory.
Software Dependencies No The paper mentions software like 'Py Torch' and 'Huggingface Transformers library' but does not specify their version numbers.
Experiment Setup Yes We train the models with cross-entropy loss using the Adam optimizer [32]. The models are trained for up to 250k steps where at each step we sample a fresh batch of 64 training examples resulting in 16 million examples over 250k steps. The models are evaluated on 5000 examples for each task. For each model, we tune the various hyperparameters, notably across learning rates {1e-2, 5e-3, . . . , 1e-6} to find the best-performing model.