Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Reproducibility Study of "Learning Perturbations to Explain Time Series Predictions"

Authors: Jiapeng Fan, Luke Cadigan, Paulius Skaisgiris, Sebastian Uriel Arias

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we attempt to reproduce the results of Enguehard (2023), which introduced Extremal Mask, a mask-based perturbation method for explaining time series data. We investigated the key claims of this paper, namely that (1) the model outperformed other models in several key metrics on both synthetic and real data, and (2) the model performed better when using the loss function of the preservation game relative to that of the deletion game. Although discrepancies exist, our results generally support the core of the original paper s conclusions. Next, we interpret Extremal Mask s outputs using new visualizations and metrics and discuss the insights each interpretation provides. Finally, we test whether Extremal Mask create out of distribution samples, and found the model does not exhibit this flaw on our tested synthetic dataset. Overall, our results support and add nuance to the original paper s findings.
Researcher Affiliation	Academia	The provided text only lists the names of the authors: Jiapeng Fan, Luke Cadigan, Paulius Skaisgiris, Sebastian Arias. It does not include any institutional affiliations, email addresses, or other information that would allow for classification of affiliation types (academia, industry, or collaboration). However, given the context of a 'Reproducibility Study' and 'Reviewed on Open Review', it is highly likely to be academic.
Pseudocode	No	The paper describes mathematical optimization problems (Equation 1, 2, 3) and outlines the Forward algorithm (FA) with equations (6, 7, 8, 9) in Section A.2. However, it does not present these as structured pseudocode or algorithm blocks in a code-like format with explicit steps.
Open Source Code	Yes	Code available at this link. Enguehard (2023) provided an open-source implementation of their proposed approach as part of the Python tint library1. The repository includes implementations of all methods used in this study, namely Dyna Mask (Crabbé & Van Der Schaar, 2021), Augmented Occlusion (Tonekaboni et al., 2020), Deep Lift (Shrikumar et al., 2017), FIT (Tjoa & Guan, 2020), Gradient Shap (Lundberg & Lee, 2017), Integrated Gradients (Sundararajan et al., 2017), Lime (Ribeiro et al., 2016), Occlusion (Zeiler & Fergus, 2014), and Retain (Choi et al., 2016). 1https://github.com/josephenguehard/time_interpret. In our codebase, we provide Jupyter notebooks that enable reproduction of our additional experiments.
Open Datasets	Yes	The original paper utilized two datasets: a synthetic dataset generated by a Hidden Markov Model (HMM) and the MIMIC-III dataset, which both pose a classification problem. Enguehard (2023) used the implementation of the HMM dataset from Crabbé & Van Der Schaar (2021). The MIMIC-III (Johnson et al., 2016) dataset includes vital sign information for over 40k intensive care unit patients at Beth Israel Deaconess Medical Center. From this dataset, we trained the classifier on 18,390 train samples and Extremal Mask on 4,598 test samples. Each sample contains a binary mortality outcome (sampled patients had a 9% mortality rate) and 31 vital signs measured over 48 hour-long time steps.
Dataset Splits	Yes	We generate a dataset containing 1000 such samples with T = 200 and D = 3, i.e. X R1000 200 3 and Y R1000 200. This dataset is split into 800 training samples and 200 test samples. From this dataset, we trained the classifier on 18,390 train samples and Extremal Mask on 4,598 test samples.
Hardware Specification	Yes	We used 1 NVIDIA A100 GPU and nine CPUs on a computer cluster for all of our reproducibility experiments. Our experiments took a total of around 81 hours to run, with the time used per claim specified in Section A.3. We ran all of our extensions on CPU (Intel i7-4720HQ), taking negligible time.
Software Dependencies	No	The paper mentions using 'Python tint library' and 'Py Torch' but does not provide specific version numbers for these or any other key software components, which is required for a reproducible description of ancillary software. It mentions 'relatively old Py Torch version' but no exact version.
Experiment Setup	Yes	We use the default hyperparameters provided by the tint library. The classifier and the NN within Extremal Mask both utilize a GRU (Cho et al., 2014) architecture across all our experiments. These choices are consistent with the original paper.