reproducibilityindex.ai

xLSTM: Extended Long Short-Term Memory

Authors: Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, Sepp Hochreiter

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experimentally evaluate x LSTM and compare it to existing methods with a focus on language modeling. We investigate x LSTM s specific capabilities on synthetic tasks in Section 4.1. In Section 4.2, we compare the validation set perplexity of various current language modeling methods that have been trained on 15B tokens from Slim Pajama (Soboleva et al., 2023). On the same dataset, we perform ablation studies for x LSTM.
Researcher Affiliation	Collaboration	1ELLIS Unit, LIT AI Lab, Institute for Machine Learning, JKU Linz, Austria 2NXAI Lab, Linz, Austria, 3NXAI GmbH, Linz, Austria
Pseudocode	No	The paper does not contain explicitly labeled "Pseudocode" or "Algorithm" blocks. It describes mathematical formulations of the models but not in a pseudocode format.
Open Source Code	Yes	Code available at: https://github.com/NX-AI/xlstm
Open Datasets	Yes	We train models on 15B tokens from Slim Pajama (Soboleva et al., 2023), and evaluate their perplexity on the validation set. (...) We use 16 out of the 18 data sources of the PALOMA dataset (Magnusson et al., 2023).
Dataset Splits	Yes	We compare the validation set perplexity of various current language modeling methods (...) We use the validation perplexity as a stopping criterion and evaluate on the test set. (...) 100,000 training samples (validation: 3,000 samples)
Hardware Specification	Yes	We developed and trained all our models and baselines over the course of three months on a cluster with 128 nodes of eight NVIDIA A100 GPUs each.
Software Dependencies	Yes	For all experiments, we use Python1 3.11 with Py Torch 2.2.02, and CUDA 12.13.
Experiment Setup	Yes	We tokenize our datasets using the Hugging Face GPT-2 tokenizer (...) we choose context length 2048 and batch sizes 256 or 512 for our models. We use the Adam W (Loshchilov & Hutter, 2019) optimizer with beta parameters (β1, β2)=(0.9, 0.95) and an epsilon parameter of 1e-5, and gradient clipping at gradient norm 1. As learning rate scheduler we use a linear warm-up with 750 steps and cosine decay to 10% of the peak learning rate. We apply a weight decay of 0.1 to all our models.