Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Contimask: Explaining Irregular Time Series via Perturbations in Continuous Time

Authors: Max Moebus, Björn Braun, Christian Holz

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We consider 5 problem settings. We first convert two commonly used synthetic scenarios for regular time series explanations into the continuous time setting... We then adapt these two scenarios... We finish by explaining a model trained on a common problem for irregular time series models: sepsis prediction from hospital records... Metrics For the Rare Time & Rare Feature settings, ground truth saliency maps are available. We calculate the F1 score (F1), Precision (Prec), and Recall (Rec) for correctly identifying these maps...
Researcher Affiliation Academia Max Moebus, Björn Braun, and Christian Holz Department of Computer Science, ETH Zurich Zurich, Switzerland {max.moebus};{bjoern.braun};{christian.holz}@inf.ethz.ch
Pseudocode No The paper describes mathematical formulations for perturbations and objective functions but does not present them in a clearly labeled 'pseudocode' or 'algorithm' block.
Open Source Code Yes Source code is available on Git Hub.
Open Datasets Yes We train a NCDE and mtan model on the sepsis prediction task as implemented by Kidger et al. [8]. We publicly share our code and all data is either synthetically created as part of the code we provide or publicly available online and we provide the download and processing scripts.
Dataset Splits Yes We only explain cases on the test set (5.4% mortality). Both models which achieves a binary AUC of roughly 0.90 on a held-out test set (the same 20% split as per [8]).
Hardware Specification Yes We run all experiments using an H200 GPU needing at most 8GB of VRAM. All experiments were performed on a H200 GPU, where the used VRAM never exceeded 8 GB.
Software Dependencies No The paper mentions using 'PGPE algorithm [25, 7] as implemented in Evo Torch [35] using the Clip Up optimizer [34]' but does not provide specific version numbers for these software components.
Experiment Setup Yes We set λ1 = 0.01, λ2 = 0.001 and train for 16,000 epochs using an Adam optimizer with a learning rate of 0.01, or 2000 iterations using the PGPE optimizer with a population size of 100. For PGPE, we initialize with a radius of 3, and a center learning rate of 0.5 ( 0.3).