xLSTM: Extended Long Short-Term Memory

Authors: Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, Sepp Hochreiter

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experimentally evaluate x LSTM and compare it to existing methods with a focus on language modeling. We investigate x LSTM s specific capabilities on synthetic tasks in Section 4.1. In Section 4.2, we compare the validation set perplexity of various current language modeling methods that have been trained on 15B tokens from Slim Pajama (Soboleva et al., 2023). On the same dataset, we perform ablation studies for x LSTM.
Researcher Affiliation Collaboration 1ELLIS Unit, LIT AI Lab, Institute for Machine Learning, JKU Linz, Austria 2NXAI Lab, Linz, Austria, 3NXAI GmbH, Linz, Austria
Pseudocode No The paper does not contain explicitly labeled "Pseudocode" or "Algorithm" blocks. It describes mathematical formulations of the models but not in a pseudocode format.
Open Source Code Yes Code available at: https://github.com/NX-AI/xlstm
Open Datasets Yes We train models on 15B tokens from Slim Pajama (Soboleva et al., 2023), and evaluate their perplexity on the validation set. (...) We use 16 out of the 18 data sources of the PALOMA dataset (Magnusson et al., 2023).
Dataset Splits Yes We compare the validation set perplexity of various current language modeling methods (...) We use the validation perplexity as a stopping criterion and evaluate on the test set. (...) 100,000 training samples (validation: 3,000 samples)
Hardware Specification Yes We developed and trained all our models and baselines over the course of three months on a cluster with 128 nodes of eight NVIDIA A100 GPUs each.
Software Dependencies Yes For all experiments, we use Python1 3.11 with Py Torch 2.2.02, and CUDA 12.13.
Experiment Setup Yes We tokenize our datasets using the Hugging Face GPT-2 tokenizer (...) we choose context length 2048 and batch sizes 256 or 512 for our models. We use the Adam W (Loshchilov & Hutter, 2019) optimizer with beta parameters (β1, β2)=(0.9, 0.95) and an epsilon parameter of 1e-5, and gradient clipping at gradient norm 1. As learning rate scheduler we use a linear warm-up with 750 steps and cosine decay to 10% of the peak learning rate. We apply a weight decay of 0.1 to all our models.