Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

xLSTM-Mixer: Multivariate Time Series Forecasting by Mixing via Scalar Memories

Authors: Maurice Kraus, Felix Divo, Devendra Singh Dhami, Kristian Kersting

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive evaluations demonstrate its superior long-term forecasting performance compared to recent state-of-the-art methods while requiring very little memory. A thorough model analysis provides further insights into its key components and confirms its robustness and effectiveness. We conduct a series of experiments to evaluate the forecasting capabilities of x LSTM-Mixer, aiming to provide comprehensive insights into its performance. Our primary focus is on long-term forecasting, following the works of Das et al. [2023], Chen et al. [2023c], Lin et al. [2024], and Liu et al. [2025]. Further tasks are explored in Sec. 4.3. Additionally, we perform an extensive model analysis, including visualizations of the initial embedding tokens, hyperparameter sensitivity, and performance measurement. Finally, an ablation study identifies the contributions of the individual components.
Researcher Affiliation Academia Maurice Kraus1, Felix Divo1, Devendra Singh Dhami2 Kristian Kersting1,3,4,5 1AI & ML Group, TU Darmstadt 2TU Eindhoven 3Hessian Center for AI (hessian.AI) 4German Research Center for AI (DFKI) 5Centre for Cognitive Science, TU Darmstadt EMAIL EMAIL
Pseudocode No The paper describes the methodology using text and mathematical equations, but it does not contain any clearly labeled pseudocode or algorithm blocks. For example, Section 3 "x LSTM-Mixer" details the architecture and mechanisms without a structured pseudocode format.
Open Source Code Yes 1Code available at https://github.com/mauricekraus/xlstm-mixer
Open Datasets Yes Datasets. We generally follow the established benchmark procedure of Wu et al. [2021] and Zhou et al. [2021] for best backward and future comparability. The datasets we thus used are provided as an overview in App. D. Table 5: The long-term forecasting benchmark datasets and their key properties. Dataset Source Domain Horizons Sampling #Variates Hurst exp. Weather Zhou et al. [2021] Weather 96 720 10 min 21 0.333 1.000 Electricity Zhou et al. [2021] Power Usage 96 720 1 hour 321 0.555 1.000 Traffic Wu et al. [2021] Traffic Load 96 720 1 hour 862 0.162 1.000 ETT Zhou et al. [2021] Power Production 96 720 15&60 min 7 0.906 1.000
Dataset Splits Yes Notation. In multivariate time series forecasting, the model is presented with a time series X = (x1, . . . , x T ) RV T consisting of T time steps with V variates each. Given this context, the forecaster shall predict the future values Y = (x T +1, . . . , x T +H) RV H up to a horizon H. A variate (sometimes called a channel) can be any scalar measurement, such as the occupancy of a road or the temperature in a power plant. The measurements are assumed to be carried out jointly, such that the T + H time steps reflect a regularly sampled signal. A time series dataset consists of N such pairs X(i), Y (i) i {1,...,N} divided into train, validation, and test portions.
Hardware Specification Yes Our codebase is implemented in Python 3.11, leveraging Py Torch version 2. [Paszke et al., 2019] in combination with Lightning version 2.42 for model training and optimization. We used the custom CUDA implementation3 for s LSTM, which relies on NVIDIA Compute Capability 8.0. Thus, our experiments were conducted on a single NVIDIA A100 80GB GPU.
Software Dependencies Yes Our codebase is implemented in Python 3.11, leveraging Py Torch version 2. [Paszke et al., 2019] in combination with Lightning version 2.42 for model training and optimization. We used the custom CUDA implementation3 for s LSTM, which relies on NVIDIA Compute Capability 8.0.
Experiment Setup Yes Training and Hyperparameters. We optimized x LSTM-Mixer in 32 bits for up to 60 epochs with a cosine-annealing scheduler with the Adam optimizer [Kingma and Ba, 2015], using β1 = 0.9 and β2 = 0.999 and no weight decay. Hyperparameter (HP) tuning was conducted using Optuna [Akiba et al., 2019] with the choices provided in Tab. 4. We optimized for the L1 forecast error, also known as the Mean Absolute Error (MAE). To further stabilize the training process, gradient clipping with a maximum norm of 1.0 was applied. All experiments were run with the three different random seeds {2021, 2022, 2023}. Table 4: Hyperparameters and their choices. Hyperparameter Choices Batch size {16, 32, 64, 128, 256, 512} Initial learning rate {1·10-2, 3·10-3, 1·10-3, 5·10-4, 2·10-4, 1·10-4} Scheduler warmup steps {5, 10, 15} Lookback length {96, 256, 512, 768, 1024, 2048} Embedding dimension D {32, 64, 128, 256, 512, 768, 1024} s LSTM conv. kernel width {disabled, 2, 4} s LSTM dropout rate {0.1, 0.25} # s LSTM blocks M {1, 2, 3, 4} # s LSTM heads {4, 8, 16, 32}