Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
ShapeX: Shapelet-Driven Post Hoc Explanations for Time Series Classification Models
Authors: Bosong Huang, Ming Jin, Yuxuan Liang, Johan Barthelemy, Debo Cheng, Qingsong Wen, Chenghao Liu, Shirui Pan
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on both synthetic and real-world datasets demonstrate that SHAPEX outperforms existing methods in identifying the most relevant subsequences, enhancing both the precision and causal fidelity of time series explanations. Our code is made available at https://github.com/Boson Hwang/Shape X |
| Researcher Affiliation | Collaboration | Bosong Huang1, Ming Jin1 , Yuxuan Liang2, Johan Barthelemy3, Debo Cheng4, Qingsong Wen5, Chenghao Liu6, Shirui Pan1 1Griffith University 2Hong Kong University of Science and Technology (Guangzhou) 3NVIDIA 4Hainan University 5Squirrel Ai Learning, USA 6Salesforce Research Asia |
| Pseudocode | No | The paper describes the methodology in Section 3 and its subsections, outlining steps and formulas, but it does not present any distinct block labeled "Pseudocode" or "Algorithm" with structured, code-like steps. |
| Open Source Code | Yes | Our code is made available at https://github.com/Boson Hwang/Shape X |
| Open Datasets | Yes | The synthetic data includes four motif-based binary classification datasets: (i) MCC-E, (ii) MTC-L, (iii) MCC-L, and (iv) MTC-E, following [37]. Each sample is annotated with ground-truth saliency for evaluating explanation quality. For real-world data, we use the ECG dataset [38], which similarly provides ground-truth labels, and the full UCR Archive [39], comprising over 100 univariate classification datasets across diverse domains. |
| Dataset Splits | Yes | For dataset splits, we follow the standard training/test partitions provided in each dataset. A validation set comprising 20% of the training set is held out to tune hyperparameters and perform early stopping. |
| Hardware Specification | Yes | All experiments were conducted on a machine equipped with an NVIDIA RTX 4090 GPU and 24 GB of RAM. |
| Software Dependencies | No | All models and explanation methods are implemented in Py Torch, and our codebase supports efficient parallel evaluation across datasets and seeds. Explanation: While PyTorch is mentioned as the implementation framework, no specific version number for PyTorch or any other software dependency is provided. |
| Experiment Setup | Yes | All experiments were conducted on a machine equipped with an NVIDIA RTX 4090 GPU and 24 GB of RAM. The black-box classifiers used in our evaluation (e.g., Transformer, CNN) are trained independently using standard cross-entropy loss until convergence, with early stopping based on validation accuracy. Unless otherwise specified, default training settings from each baselines original implementation are followed. To ensure robustness and account for variability in training and explanation outputs, we repeat each experiment across five random seeds, reporting the mean and standard deviation as error bars. The random seeds affect both model initialization and data shuffling. For methods involving sampling-based perturbation (e.g., TIMEX++), the same set of seeds is applied to ensure fair comparison. |