Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Conformal Online Learning of Deep Koopman Linear Embeddings

Authors: Ben Gao, Jordan Patracone, Stephane Chretien, Olivier Alata

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We now conduct experiments on multiple datasets derived either from solving differential equations associated with canonical dynamical systems, or from real-world sequential measurements. For the sake of reproducibility, all datasets and methods can be found at https://github.com/ben2022lo/COLo Ke, and we report full implementation details as well as complementary numerical studies in the supplementary material.
Researcher Affiliation	Academia	Ben Gao Université Jean Monnet Saint-Etienne, CNRS, Institut d Optique Graduate School, Inria, Laboratoire Hubert Curien UMR 5516, F-42023, SAINT-ETIENNE, France EMAIL Jordan Patracone Université Jean Monnet Saint-Etienne, CNRS, Institut d Optique Graduate School, Inria, Laboratoire Hubert Curien UMR 5516, F-42023, SAINT-ETIENNE, France EMAIL Stéphane Chrétien Université Lyon 2 Laboratoire ERIC Bron, France EMAIL Olivier Alata Université Jean Monnet Saint-Etienne, CNRS, Institut d Optique Graduate School, Laboratoire Hubert Curien UMR 5516, F-42023, SAINT-ETIENNE, France EMAIL
Pseudocode	Yes	Algorithm 1 Conformal Online Learning of Koopman embeddings (COLo Ke) Require: Buffer size w, initial parameters (θw 1, Kw 1), step-size η > 0 1: Initialize conformity threshold qw 2: for t = w, w + 1, . . . do 3: Observe new state xt and update buffer Dt = {xt w, . . . , xt} 4: Set (θt, Kt) (θt 1, Kt 1) 5: Compute prediction conformity score st ℓt w,w(θt, Kt) See Eq. (7) 6: Update threshold: qt+1 Conformal PI(qt) See Eq. (3) with et = 1{st > qt} 7: while st > qt do 8: Perform a gradient-based step: (θt, Kt) (θt, Kt) η θ,KLt(θt, Kt) See Eq. (5) 9: Recompute st ℓt w,w(θt, Kt) 10: end while 11: end for
Open Source Code	Yes	For the sake of reproducibility, all datasets and methods can be found at https://github.com/ben2022lo/COLo Ke
Open Datasets	Yes	We complement them with three real-world datasets: the Electricity Transformer Dataset (ETD) [Zhou et al., 2021], the EEG Motor Movement/Imagery Dataset [Schalk et al., 2004, Goldberger et al., 2000] and a real turbulence dataset CASES-99 [Earth Observing Laboratory, 1999]. ETD We test on ETTh1 which is part of the Electricity Transformer Dataset (ETDataset)0 introduced by Zhou et al. [2021]. ETTh1 contains data recorded at hourly intervals for roughly two years from an electricity transformer station. The dataset consists of a single trajectory of 6 features: HUFL (High Use Frequency Load), HULL (High Use Low Load), MUFL (Medium Use Frequency Load), MULL (Medium Use Low Load), LUFL (Low Use Frequency Load), and LULL (Low Use Low Load). In our setting, we retain 200 time steps and aim to learn the transformer s load profile as a dynamical system. EEG To assess the COLo Ke s performance on high-dimensional data, we evaluate it on electroencephalogram (EEG) recordings with 64 channels from the Physio Net EEG Motor Movement/Imagery dataset 1. We use the recording of Subject 1 in the eyes open condition. The original signal consists of one minute of data sampled at 160 Hz. We downsample it to 16 Hz, resulting in a trajectory of dimension d = 64 and of length T = 976. CASES-99 To further evaluate COLo Ke on real-world atmospheric turbulence, we use data from the CASES-992(Cooperative Atmospheric Surface Exchange Study 1999) field experiment. The CASES-99 examined boundary-layer turbulence using high-frequency measurements from instrumented towers and remote sensing systems. In our setting, we use the recordings from the 55m tower. Specifically, we extract the three-dimensional wind components u, v, and w and retain 500 time steps. This yields a trajectory of dimension d = 3 and length T = 500. 0https://github.com/zhouhaoyi/ETDataset 1https://physionet.org/content/eegmmidb/1.0.0/ 2https://www.eol.ucar.edu/field_projects/cases-99
Dataset Splits	Yes	For each dynamic, we simulate 2000 trajectories and construct 5 random train-test splits {(Itrain k , Itest k )}5 k=1. For each split, models are trained using the training trajectories while computing the online prediction error εk = 1 \|Itrain k \| t=t0+1 x(i) t Modelt 1(x(i) t 1) 2. (12) After the training, models are evaluated on the held-out trajectories to compute generalization error ξk = 1 \|Itest k \| t=2 x(i) t Model T (x(i) t 1) 2. (13)
Hardware Specification	Yes	Figure 2b reports the test error as a function of training time (in seconds), measured on an NVIDIA RTX 2000 ADA GPU.
Software Dependencies	No	The paper mentions libraries like `Sci Py` (Appendix B.1.1 for `odeint`) but does not provide version numbers for any software dependencies.
Experiment Setup	Yes	All models are trained using the Adam W optimizer with a learning rate of 10 3, both during parameter initialization and online training. The initialization phase consists of 4000 epochs for synthetic datasets and 5000 epochs for the real-world dataset. The neural network Φ is fully connected with architecture {d, 32, 16, 8, m d} for synthetic datasets and {d, 64, 32, 16, m d} for the real dataset. The dimension of lifted representation m is chosen to be d + d/2 for all experiments. Model parameters and initial conformity threshold are initialized with {x0, , xt0} as already discussed. The initial threshold qt0+1 is set to be 1 α quantile of the set of scores {sw+1, . . . , st0} computed with the initialized model. The model parameters (θt, Kt) are updated online according to Algorithm 1. For all synthetic datasets, the hyperparameters for Conformal PI procedure are α = 0.5, γ = 0.1, Csat = 5, and we set Csat = 10 for the real dataset. Note that the coverage guarantees [Angelopoulos et al., 2023] hold for any value of γ > 0. We set γ = 0.1 in our experiments as in [Angelopoulos et al., 2023], although one may implement adaptive strategies as in [Bhatnagar et al., 2023]. The choice of α should reflect the trade-off between computational efficiency and predictive accuracy, depending on the requirements of the application. When α 1, accuracy is prioritized over speed, whereas for α 1, computational speed is more important. In our experiments, we selected α = 0.5 as a balanced compromise between these two objectives. OLo Ke. The only difference between COLo Ke and OLo Ke is the online training strategy. For every new buffer Dt, OLo Ke performs a fixed number of iterations. Online AE. We implement the model in [Liang et al., 2022]. The original work tackles the problem of Model Predictive Control (MPC). We aim only to perform online learning of dynamics. For synthetic datasets, the encoder architecture is {d, 32, 16, 8, m} and the decoder architecture is {m, 8, 16, 32, d}. For the real dataset, the encoder architecture is {d, 64, 32, 16, m} and the decoder architecture is {m, 64, 32, 16, d}. The dimension of lifted representation m is chosen to be d + d/2 . Thus, the model architecture of Online AE aligns with COLo Ke and OLo Ke. The loss function at time t is defined as Lt(Dt, Φt, Ψt, Kt) = k=t w Ψt [KtΦt(xk 1)] xk 2 \| {z } prediction loss + Ψt Φt(xk) xk 2 \| {z } autoencoding loss + KtΦt(xk 1) Φt(xk) 2 \| {z } lifted prediction loss where Φt is the encoder and Ψt is the decoder. The online training strategy consists of fixed iterations with Niter = 100 for synthetic datasets and Niter = 500 for the real dataset, which aligns with OLo Ke. In the original work [Sinha et al., 2019], the authors used Radial Basis Functions (RBF) as the fixed dictionary. However, to estimate informative centers for RBF, one needs to have access to the full trajectory up to time T. When estimating the centers with trajectory up to time t0, the model gives poor results on all datasets for both metrics. Therefore, we choose a polynomial dictionary of degree 2, which provides a lifted representation dimension comparable to other baseline models. Increasing the degree beyond 2 offers no significant performance gain. On the contrary, it degrades the performance on the real dataset and results in substantially higher computational costs.