Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

On the Reproducibility of: "Learning Perturbations to Explain Time Series Predictions"

Authors: Wouter Bant, Ádám Divák, Jasper Eppink, Floris Six Dijkstra

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This study tries to reproduce and extend the work of Enguehard (2023b), focusing on time series explainability by incorporating learnable masks and perturbations. Enguehard (2023b) employed two methods to learn these masks and perturbations, the preservation game (yielding SOTA results) and the deletion game (with poor performance). We extend the work by revising the deletion game s loss function, testing the robustness of the proposed method on a novel weather dataset, and visualizing the learned masks and perturbations. Despite notable discrepancies in results across many experiments, our findings demonstrate that the proposed method consistently outperforms all baselines and exhibits robust performance across datasets.
Researcher Affiliation	Academia	Wouter Bant EMAIL Faculty of Science University of Amsterdam Ádám Divák EMAIL Faculty of Science University of Amsterdam Jasper Eppink EMAIL Faculty of Science University of Amsterdam Floris Six Dijkstra EMAIL Faculty of Science University of Amsterdam
Pseudocode	No	The paper provides formal equations for objective functions (e.g., Equations 1, 2, 3, 4) and a schematic flowchart (Figure 5), but no explicit 'Pseudocode' or 'Algorithm' block with structured code-like steps.
Open Source Code	Yes	Our implementation is available at https://github.com/adamdivak/time_interpret 1The source code of the research undertaken by the author is publicly accessible (Enguehard, 2023a). 2Provided Source Code: https://github.com/josephenguehard/time_interpret, the same code but with some additions we made: https://github.com/Anonymous8523/Repro
Open Datasets	Yes	The original paper conducted experiments on two datasets. The first one is a synthetic dataset generated by a Hidden Markov Model (HMM), which is closely related to the HMM dataset used in both Crabbé & Van Der Schaar (2021) and Tonekaboni et al. (2020). For this dataset, we know the true saliency which makes the evaluation of the explainer methods easier. The second dataset in the original paper is the MIMIC-III dataset, containing the vital signs and lab measurements of patients in intensive care units. Both datasets have a binary variable as the target variable. Our research expands on these datasets by also including weather data with a binary target variable. The MIMIC-III clinical database (version 1.4) was used for conducting this research. ... Alistair E.W. Johnson, Tom J. Pollard, Lu Shen, Li-wei H. Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. Mimic-iii, a freely accessible critical care database. Scientific Data, 3(1):160035, May 2016. The meteorological data is sourced from Dutch weather stations and is publicly available through the Royal Dutch Meteorological Institute, which falls under the Ministry of Infrastructure and Water Management.
Dataset Splits	Yes	In each dataset, 20% is allocated for testing, while the remaining 80% is utilized for cross-validation with 5 folds. The reported metrics include the mean and standard deviations of the performance across these 5 folds on the test set.
Hardware Specification	Yes	For the Hidden Markov Model (HMM) with 20 time steps, we employed a CPU, specifically the 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz. The computational process took approximately 45 hours to replicate the results outlined in our paper. Notably, for all other presented results, a single NVIDIA A100 GPU was utilized, requiring approximately 30 hours for completion.
Software Dependencies	No	The paper mentions models like Bidirectional Gated Recurrent Unit (Bi-GRU) and refers to mathematical equations, but does not list specific software libraries or their version numbers (e.g., TensorFlow, PyTorch, Scikit-learn, with versions).
Experiment Setup	Yes	In our experiments, the Extr Mask method tries to explain the predictions of a classifier that is trained on the unaltered data. This classifier uses a bidirectional Gated Recurrent Unit (Bi-GRU) (Cho et al., 2014) with a hidden state size of 200. The masks and perturbation network are jointly learned with the training objectives presented in equations 2 and 3 for the preservation and deletion game, respectively. The perturbation network also uses a Bi-GRU model. In the original paper, none of the hyperparameters are explicitly specified. However, the provided source code2 has default parameters, which we assume to be the same employed in the study.