Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

xLSTM 7B: A Recurrent LLM for Fast and Efficient Inference

Authors: Maximilian Beck, Korbinian Pöppel, Phillip Lippe, Richard Kurle, Patrick M Blies, Günter Klambauer, Sebastian Böck, Sepp Hochreiter

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that x LSTM 7B achieves performance on downstream tasks comparable to other similar-sized LLMs, while providing significantly faster inference speeds and greater efficiency compared to Llamaand Mamba-based LLMs. These results establish x LSTM 7B as the fastest and most efficient 7B LLM, offering a solution for tasks that require large amounts of test-time computation. In our evaluations on language downstream and long-context tasks, x LSTM 7B shows comparable performance to Transformers and Mamba models of the same size, but with our optimized block architecture it achieves the highest prefill and generation throughput with the lowest GPU memory footprint on our inference efficiency benchmarks.
Researcher Affiliation	Collaboration	1NXAI Gmb H, Linz, Austria 2Johannes Kepler University, Linz, Austria 3Now at Google Deepmind.
Pseudocode	No	The paper describes the recurrent formulation of the m LSTM cell using mathematical equations (1)-(11) and provides architectural diagrams (Figure 1, 8), but it does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our model weights, model code and training code are open-source. Model: https://huggingface.co/NX-AI/x LSTM-7b Code: https://github.com/NX-AI/xlstm and https://github.com/NX-AI/xlstm-jax.
Open Datasets	Yes	trained on 2.3T tokens from the DCLM dataset (Li et al., 2024)... We only use publicly available high-quality datasets for pre-training. ... The dataset proportions for the second stage are listed in the second column of Tab. 5.
Dataset Splits	No	The paper mentions training on 2.3T tokens over 550K steps and ablation trainings on 160B tokens for 76,000 steps with specific batch sizes and context lengths. It also refers to 'evaluations on language downstream and long-context tasks' (Table 1, Figure 3) and 'validation perplexity' (Figure 9, 10, 11). However, it does not explicitly detail the train/validation/test splits (e.g., percentages or exact counts) for any of the datasets used, nor does it cite predefined splits for specific tasks in a way that allows reproduction of the data partitioning.
Hardware Specification	Yes	Pre-training was conducted on a high-performance computing cluster comprising 128 NVIDIA H100 GPUs. We benchmark generative inference with our x LSTM 7B model on a single NVIDIA H100 GPU with batch size 1, unless specified otherwise.
Software Dependencies	Yes	We use model implementations from Huggingface transformers library and optimize each with torch.compile and Py Torch CUDA Graphs (Nguyen et al., 2021). For all v LLM speeds, we use Py Torch 2.6.0 to enable Codestral-Mamba-7b, whereas for the Hugging Face speed experiments, we use Py Torch 2.5.1.
Experiment Setup	Yes	We pre-train x LSTM 7B for a total of 550K (thousand) training steps with batch size 512 and context length 8192, encompassing a total of 2.3T (trillion) training tokens. We apply batch size ramp-up with batch size 128 for the first 2000 steps, 256 for the next 2000 steps, and the full batch size (512) afterward. We use the Adam W optimizer (Loshchilov & Hutter, 2019) with (peak) α = 5 10 4, β1 = 0.99, β2 = 0.95, ϵ = 10 8, weight decay 0.1 and gradient clipping norm 0.5. The learning rate schedule comprises a linear warm-up over 3000 training steps, an exponential decay phase that spans 540,000 steps, and a linear cool-down lasting 7000 steps.