Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Reproducibility Study of "Learning Perturbations to Explain Time Series Predictions"
Authors: Jiapeng Fan, Luke Cadigan, Paulius Skaisgiris, Sebastian Uriel Arias
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we attempt to reproduce the results of Enguehard (2023), which introduced Extremal Mask, a mask-based perturbation method for explaining time series data. We investigated the key claims of this paper, namely that (1) the model outperformed other models in several key metrics on both synthetic and real data, and (2) the model performed better when using the loss function of the preservation game relative to that of the deletion game. Although discrepancies exist, our results generally support the core of the original paper s conclusions. Next, we interpret Extremal Mask s outputs using new visualizations and metrics and discuss the insights each interpretation provides. Finally, we test whether Extremal Mask create out of distribution samples, and found the model does not exhibit this flaw on our tested synthetic dataset. Overall, our results support and add nuance to the original paper s findings. |
| Researcher Affiliation | Academia | The provided text only lists the names of the authors: Jiapeng Fan, Luke Cadigan, Paulius Skaisgiris, Sebastian Arias. It does not include any institutional affiliations, email addresses, or other information that would allow for classification of affiliation types (academia, industry, or collaboration). However, given the context of a 'Reproducibility Study' and 'Reviewed on Open Review', it is highly likely to be academic. |
| Pseudocode | No | The paper describes mathematical optimization problems (Equation 1, 2, 3) and outlines the Forward algorithm (FA) with equations (6, 7, 8, 9) in Section A.2. However, it does not present these as structured pseudocode or algorithm blocks in a code-like format with explicit steps. |
| Open Source Code | Yes | Code available at this link. Enguehard (2023) provided an open-source implementation of their proposed approach as part of the Python tint library1. The repository includes implementations of all methods used in this study, namely Dyna Mask (Crabbé & Van Der Schaar, 2021), Augmented Occlusion (Tonekaboni et al., 2020), Deep Lift (Shrikumar et al., 2017), FIT (Tjoa & Guan, 2020), Gradient Shap (Lundberg & Lee, 2017), Integrated Gradients (Sundararajan et al., 2017), Lime (Ribeiro et al., 2016), Occlusion (Zeiler & Fergus, 2014), and Retain (Choi et al., 2016). 1https://github.com/josephenguehard/time_interpret. In our codebase, we provide Jupyter notebooks that enable reproduction of our additional experiments. |
| Open Datasets | Yes | The original paper utilized two datasets: a synthetic dataset generated by a Hidden Markov Model (HMM) and the MIMIC-III dataset, which both pose a classification problem. Enguehard (2023) used the implementation of the HMM dataset from Crabbé & Van Der Schaar (2021). The MIMIC-III (Johnson et al., 2016) dataset includes vital sign information for over 40k intensive care unit patients at Beth Israel Deaconess Medical Center. From this dataset, we trained the classifier on 18,390 train samples and Extremal Mask on 4,598 test samples. Each sample contains a binary mortality outcome (sampled patients had a 9% mortality rate) and 31 vital signs measured over 48 hour-long time steps. |
| Dataset Splits | Yes | We generate a dataset containing 1000 such samples with T = 200 and D = 3, i.e. X R1000 200 3 and Y R1000 200. This dataset is split into 800 training samples and 200 test samples. From this dataset, we trained the classifier on 18,390 train samples and Extremal Mask on 4,598 test samples. |
| Hardware Specification | Yes | We used 1 NVIDIA A100 GPU and nine CPUs on a computer cluster for all of our reproducibility experiments. Our experiments took a total of around 81 hours to run, with the time used per claim specified in Section A.3. We ran all of our extensions on CPU (Intel i7-4720HQ), taking negligible time. |
| Software Dependencies | No | The paper mentions using 'Python tint library' and 'Py Torch' but does not provide specific version numbers for these or any other key software components, which is required for a reproducible description of ancillary software. It mentions 'relatively old Py Torch version' but no exact version. |
| Experiment Setup | Yes | We use the default hyperparameters provided by the tint library. The classifier and the NN within Extremal Mask both utilize a GRU (Cho et al., 2014) architecture across all our experiments. These choices are consistent with the original paper. |