Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
TIMING: Temporality-Aware Integrated Gradients for Time Series Explanation
Authors: Hyeongwon Jang, Changhun Kim, Eunho Yang
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on synthetic and real-world time series benchmarks demonstrate that TIMING outperforms existing time series XAI baselines. Section 5. Experiments This section presents a comprehensive evaluation of the empirical effectiveness of TIMING. |
| Researcher Affiliation | Collaboration | 1Kim Jaechul Graduate School of AI, KAIST, Daejeon, South Korea 2AITRICS, Seoul, South Korea. |
| Pseudocode | Yes | The overall framework of our method is illustrated in Figure 2 and the detailed algorithm is provided in Appendix A. A. Algorithm The detailed procedure for the efficient version of TIMING that we used in our experiments is provided in Algorithm 1. Algorithm 1 TIMING |
| Open Source Code | Yes | Our code is available at https://github.com/drumpt/TIMING. |
| Open Datasets | Yes | For synthetic datasets, we utilize Switch-Feature (Tonekaboni et al., 2020; Liu et al., 2024b) and State (Tonekaboni et al., 2020; Crabb e & Van Der Schaar, 2021). For real-world datasets, we employ MIMIC-III (Johnson et al., 2016), Personal Activity Monitoring (PAM) (Reiss & Stricker, 2012), Boiler (Shohet et al., 2019), Epilepsy (Andrzejak et al., 2001), Wafer (Dau et al., 2019), and Freezer (Dau et al., 2019). These datasets span a wide range of real-world time series domains, ensuring a comprehensive evaluation of TIMING s effectiveness. Detailed descriptions of the datasets are provided in Appendix D. |
| Dataset Splits | Yes | The dataset is split into 800 training samples and 200 test samples for evaluating time series XAI methods. We train on the first 800 sequences and reserve the remaining 200 for evaluation. Results are aggregated with mean standard error over five random cross-validation repetitions. |
| Hardware Specification | No | No specific hardware details are provided in the paper. While computational efficiency is discussed, no specific GPU, CPU, or memory specifications are mentioned. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions) are explicitly mentioned in the paper. |
| Experiment Setup | Yes | We primarily evaluate TIMING on black-box classifiers using a single-layer GRU (Chung et al., 2014), following the experimental protocols of prior works (Tonekaboni et al., 2020; Leung et al., 2023; Crabb e & Van Der Schaar, 2021; Enguehard, 2023; Liu et al., 2024b;a). To demonstrate the model-agnostic versatility of TIMING, we assess its performance on CNNs (Krizhevsky et al., 2012) and Transformers (Vaswani et al., 2017) in Appendix E. As illustrated in Table 8, TIMING can generalize across the type of black box model. Table 6: Hyperparameter sensitivity analysis for (n, smin, smax), reporting CPD (K = 50) on MIMIC-III with average and zero substitutions. Meanwhile, our default setting of (n, smin, smax) = (50, 10, 48) achieves optimal results by balancing these factors. |