Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

TIMING: Temporality-Aware Integrated Gradients for Time Series Explanation

Authors: Hyeongwon Jang, Changhun Kim, Eunho Yang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on synthetic and real-world time series benchmarks demonstrate that TIMING outperforms existing time series XAI baselines. Section 5. Experiments This section presents a comprehensive evaluation of the empirical effectiveness of TIMING.
Researcher Affiliation	Collaboration	1Kim Jaechul Graduate School of AI, KAIST, Daejeon, South Korea 2AITRICS, Seoul, South Korea.
Pseudocode	Yes	The overall framework of our method is illustrated in Figure 2 and the detailed algorithm is provided in Appendix A. A. Algorithm The detailed procedure for the efficient version of TIMING that we used in our experiments is provided in Algorithm 1. Algorithm 1 TIMING
Open Source Code	Yes	Our code is available at https://github.com/drumpt/TIMING.
Open Datasets	Yes	For synthetic datasets, we utilize Switch-Feature (Tonekaboni et al., 2020; Liu et al., 2024b) and State (Tonekaboni et al., 2020; Crabb e & Van Der Schaar, 2021). For real-world datasets, we employ MIMIC-III (Johnson et al., 2016), Personal Activity Monitoring (PAM) (Reiss & Stricker, 2012), Boiler (Shohet et al., 2019), Epilepsy (Andrzejak et al., 2001), Wafer (Dau et al., 2019), and Freezer (Dau et al., 2019). These datasets span a wide range of real-world time series domains, ensuring a comprehensive evaluation of TIMING s effectiveness. Detailed descriptions of the datasets are provided in Appendix D.
Dataset Splits	Yes	The dataset is split into 800 training samples and 200 test samples for evaluating time series XAI methods. We train on the first 800 sequences and reserve the remaining 200 for evaluation. Results are aggregated with mean standard error over five random cross-validation repetitions.
Hardware Specification	No	No specific hardware details are provided in the paper. While computational efficiency is discussed, no specific GPU, CPU, or memory specifications are mentioned.
Software Dependencies	No	No specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions) are explicitly mentioned in the paper.
Experiment Setup	Yes	We primarily evaluate TIMING on black-box classifiers using a single-layer GRU (Chung et al., 2014), following the experimental protocols of prior works (Tonekaboni et al., 2020; Leung et al., 2023; Crabb e & Van Der Schaar, 2021; Enguehard, 2023; Liu et al., 2024b;a). To demonstrate the model-agnostic versatility of TIMING, we assess its performance on CNNs (Krizhevsky et al., 2012) and Transformers (Vaswani et al., 2017) in Appendix E. As illustrated in Table 8, TIMING can generalize across the type of black box model. Table 6: Hyperparameter sensitivity analysis for (n, smin, smax), reporting CPD (K = 50) on MIMIC-III with average and zero substitutions. Meanwhile, our default setting of (n, smin, smax) = (50, 10, 48) achieves optimal results by balancing these factors.