Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

True Zero-Shot Inference of Dynamical Systems Preserving Long-Term Statistics

Authors: Christoph Jürgen Hemmer, Daniel Durstewitz

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our model on multiple benchmark DS and real-world time series, demonstrating superior zero-shot generalization on DSR problems compared to current TS foundation models. Just from a provided context signal, without any re-training, Dyna Mix faithfully forecasts the long-term evolution of novel DS where existing time series (TS) foundation models, like Chronos, fail at a fraction of the number of parameters (0.1%) and orders of magnitude faster inference times. Dyna Mix outperforms TS foundation models in terms of longterm statistics, and often also short-term forecasts, even on real-world time series, like traffic or weather data, typically used for training and evaluating TS models, but not at all part of Dyna Mix training corpus. We illustrate some of the failure modes of TS models for DSR problems, and conclude that models built on DS principles may bear a huge potential also for advancing the TS prediction field.
Researcher Affiliation	Academia	Christoph Jürgen Hemmer1,3 , Daniel Durstewitz1,2,3 1Dept. of Theoretical Neuroscience, Central Institute of Mental Health, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany 2Interdisciplinary Center for Scientific Computing (IWR), Heidelberg University, Germany 3Faculty of Physics and Astronomy, Heidelberg University, Heidelberg, Germany EMAIL
Pseudocode	No	The paper describes the model architecture and training methods in Section 3.1 'Model architecture' and 3.2 'Model training' using natural language and mathematical equations, but does not include any explicit figure, block, or section labeled 'Pseudocode' or 'Algorithm'.
Open Source Code	Yes	Code available at https://github.com/DurstewitzLab/DynaMix-julia (Julia), https://github.com/DurstewitzLab/DynaMix-python (Python).
Open Datasets	Yes	Training data Dyna Mix is trained on about 0.6 million simulated time series of length T = 550 sampled from 34 different 3d DS with cyclic or chaotic attractors, collected in [30].... The traffic data are hourly recordings of the number of cars passing road junctions (https://www.kaggle.com/datasets/fedesoriano/traffic-prediction-dataset/data). The cloud data, also used in [86] to evaluate TS foundation models, are publicly available from Huawei Cloud (https://github.com/sir-lab/data-release). ...The weather data... can be accessed via https://www.dwd.de/EN/ourservices/cdc/cdc_ueberblick-klimadaten_en.html. The functional magnetic resonance imaging (fMRI) data... is publicly available on Git Hub [51]. The ETTh1 dataset... can be accessed at https://github.com/zhouhaoyi/ETDataset. Electroencephalogram (EEG) data were taken from a study by Schalk et al. [77]...
Dataset Splits	Yes	From each, multivariate time series X RN T are simulated ( 6 105 in total) and then standardized dimension-wise (likewise for the test set), of which the first TC < T column entries are defined as the context signal C = X1:TC. The context length used in training was set to TC = 500, and the window of overlap with the model-generated time series to t = 50, see Sect. 3.2. Test data Our test set for DS consists of simulated time series of length 105 sampled from 54 different 3d DS collected in [30], which are not part of the training set.
Hardware Specification	Yes	Training was performed on a single CPU (18-Core Xeon Gold 6254)... For comparability, all models were evaluated on the exact same CPU (18-Core Xeon Gold 6254) and GPU (Nvidia RTX 2080 Ti) using 512GB of RAM.
Software Dependencies	No	The paper mentions 'Julia' and 'Python' as programming languages for the code and 'Rectified adaptive moment estimation (RADAM) [54]' as the optimizer. However, it does not specify version numbers for these languages or any software libraries used in the implementation.
Experiment Setup	Yes	For training our model, we used a variant of sparse teacher forcing (STF)... with τ = 10 here... To this we add a regularization term ... where we chose λ = 0.1 and c = 0.01. Rectified adaptive moment estimation (RADAM) [54] was employed as the optimizer, with L = 50 batches of SB = 16 sequences per epoch, each of length T = 550, and 2000 epochs in total. We used a learning rate exponentially decaying from ηstart = 5 10 3 to ηend = 10 5. The context length used in training was set to TC = 500, and the window of overlap with the model-generated time series to t = 50... We use J = 10 AL-RNN experts for our model. Each expert has a latent dimension of M = 30, of which P = 2 are rectified-linear units (Re LUs). ...The gating network is implemented using a single-layer CNN with three channels and a kernel size of 2, stride of 1, and zero padding, with the identity as activation function. The MLP... consists of two layers with Re LU activation. We initialize the temperature weights τatt and τexp to 0.1, the covariance matrix Σ = 0.05 1, matrix D = I(N M)...