Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Contimask: Explaining Irregular Time Series via Perturbations in Continuous Time

Authors: Max Moebus, Björn Braun, Christian Holz

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We consider 5 problem settings. We first convert two commonly used synthetic scenarios for regular time series explanations into the continuous time setting... We then adapt these two scenarios... We finish by explaining a model trained on a common problem for irregular time series models: sepsis prediction from hospital records... Metrics For the Rare Time & Rare Feature settings, ground truth saliency maps are available. We calculate the F1 score (F1), Precision (Prec), and Recall (Rec) for correctly identifying these maps...
Researcher Affiliation	Academia	Max Moebus, Björn Braun, and Christian Holz Department of Computer Science, ETH Zurich Zurich, Switzerland {max.moebus};{bjoern.braun};{christian.holz}@inf.ethz.ch
Pseudocode	No	The paper describes mathematical formulations for perturbations and objective functions but does not present them in a clearly labeled 'pseudocode' or 'algorithm' block.
Open Source Code	Yes	Source code is available on Git Hub.
Open Datasets	Yes	We train a NCDE and mtan model on the sepsis prediction task as implemented by Kidger et al. [8]. We publicly share our code and all data is either synthetically created as part of the code we provide or publicly available online and we provide the download and processing scripts.
Dataset Splits	Yes	We only explain cases on the test set (5.4% mortality). Both models which achieves a binary AUC of roughly 0.90 on a held-out test set (the same 20% split as per [8]).
Hardware Specification	Yes	We run all experiments using an H200 GPU needing at most 8GB of VRAM. All experiments were performed on a H200 GPU, where the used VRAM never exceeded 8 GB.
Software Dependencies	No	The paper mentions using 'PGPE algorithm [25, 7] as implemented in Evo Torch [35] using the Clip Up optimizer [34]' but does not provide specific version numbers for these software components.
Experiment Setup	Yes	We set λ1 = 0.01, λ2 = 0.001 and train for 16,000 epochs using an Adam optimizer with a learning rate of 0.01, or 2000 iterations using the PGPE optimizer with a population size of 100. For PGPE, we initialize with a radius of 3, and a center learning rate of 0.5 ( 0.3).