Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
xLSTM 7B: A Recurrent LLM for Fast and Efficient Inference
Authors: Maximilian Beck, Korbinian Pöppel, Phillip Lippe, Richard Kurle, Patrick M Blies, Günter Klambauer, Sebastian Böck, Sepp Hochreiter
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that x LSTM 7B achieves performance on downstream tasks comparable to other similar-sized LLMs, while providing significantly faster inference speeds and greater efficiency compared to Llamaand Mamba-based LLMs. These results establish x LSTM 7B as the fastest and most efficient 7B LLM, offering a solution for tasks that require large amounts of test-time computation. In our evaluations on language downstream and long-context tasks, x LSTM 7B shows comparable performance to Transformers and Mamba models of the same size, but with our optimized block architecture it achieves the highest prefill and generation throughput with the lowest GPU memory footprint on our inference efficiency benchmarks. |
| Researcher Affiliation | Collaboration | 1NXAI Gmb H, Linz, Austria 2Johannes Kepler University, Linz, Austria 3Now at Google Deepmind. |
| Pseudocode | No | The paper describes the recurrent formulation of the m LSTM cell using mathematical equations (1)-(11) and provides architectural diagrams (Figure 1, 8), but it does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our model weights, model code and training code are open-source. Model: https://huggingface.co/NX-AI/x LSTM-7b Code: https://github.com/NX-AI/xlstm and https://github.com/NX-AI/xlstm-jax. |
| Open Datasets | Yes | trained on 2.3T tokens from the DCLM dataset (Li et al., 2024)... We only use publicly available high-quality datasets for pre-training. ... The dataset proportions for the second stage are listed in the second column of Tab. 5. |
| Dataset Splits | No | The paper mentions training on 2.3T tokens over 550K steps and ablation trainings on 160B tokens for 76,000 steps with specific batch sizes and context lengths. It also refers to 'evaluations on language downstream and long-context tasks' (Table 1, Figure 3) and 'validation perplexity' (Figure 9, 10, 11). However, it does not explicitly detail the train/validation/test splits (e.g., percentages or exact counts) for any of the datasets used, nor does it cite predefined splits for specific tasks in a way that allows reproduction of the data partitioning. |
| Hardware Specification | Yes | Pre-training was conducted on a high-performance computing cluster comprising 128 NVIDIA H100 GPUs. We benchmark generative inference with our x LSTM 7B model on a single NVIDIA H100 GPU with batch size 1, unless specified otherwise. |
| Software Dependencies | Yes | We use model implementations from Huggingface transformers library and optimize each with torch.compile and Py Torch CUDA Graphs (Nguyen et al., 2021). For all v LLM speeds, we use Py Torch 2.6.0 to enable Codestral-Mamba-7b, whereas for the Hugging Face speed experiments, we use Py Torch 2.5.1. |
| Experiment Setup | Yes | We pre-train x LSTM 7B for a total of 550K (thousand) training steps with batch size 512 and context length 8192, encompassing a total of 2.3T (trillion) training tokens. We apply batch size ramp-up with batch size 128 for the first 2000 steps, 256 for the next 2000 steps, and the full batch size (512) afterward. We use the Adam W optimizer (Loshchilov & Hutter, 2019) with (peak) α = 5 10 4, β1 = 0.99, β2 = 0.95, ϵ = 10 8, weight decay 0.1 and gradient clipping norm 0.5. The learning rate schedule comprises a linear warm-up over 3000 training steps, an exponential decay phase that spans 540,000 steps, and a linear cool-down lasting 7000 steps. |