Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ShapeX: Shapelet-Driven Post Hoc Explanations for Time Series Classification Models

Authors: Bosong Huang, Ming Jin, Yuxuan Liang, Johan Barthelemy, Debo Cheng, Qingsong Wen, Chenghao Liu, Shirui Pan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on both synthetic and real-world datasets demonstrate that SHAPEX outperforms existing methods in identifying the most relevant subsequences, enhancing both the precision and causal ﬁdelity of time series explanations. Our code is made available at https://github.com/Boson Hwang/Shape X
Researcher Affiliation	Collaboration	Bosong Huang1, Ming Jin1 , Yuxuan Liang2, Johan Barthelemy3, Debo Cheng4, Qingsong Wen5, Chenghao Liu6, Shirui Pan1 1Grifﬁth University 2Hong Kong University of Science and Technology (Guangzhou) 3NVIDIA 4Hainan University 5Squirrel Ai Learning, USA 6Salesforce Research Asia
Pseudocode	No	The paper describes the methodology in Section 3 and its subsections, outlining steps and formulas, but it does not present any distinct block labeled "Pseudocode" or "Algorithm" with structured, code-like steps.
Open Source Code	Yes	Our code is made available at https://github.com/Boson Hwang/Shape X
Open Datasets	Yes	The synthetic data includes four motif-based binary classiﬁcation datasets: (i) MCC-E, (ii) MTC-L, (iii) MCC-L, and (iv) MTC-E, following [37]. Each sample is annotated with ground-truth saliency for evaluating explanation quality. For real-world data, we use the ECG dataset [38], which similarly provides ground-truth labels, and the full UCR Archive [39], comprising over 100 univariate classiﬁcation datasets across diverse domains.
Dataset Splits	Yes	For dataset splits, we follow the standard training/test partitions provided in each dataset. A validation set comprising 20% of the training set is held out to tune hyperparameters and perform early stopping.
Hardware Specification	Yes	All experiments were conducted on a machine equipped with an NVIDIA RTX 4090 GPU and 24 GB of RAM.
Software Dependencies	No	All models and explanation methods are implemented in Py Torch, and our codebase supports efﬁcient parallel evaluation across datasets and seeds. Explanation: While PyTorch is mentioned as the implementation framework, no specific version number for PyTorch or any other software dependency is provided.
Experiment Setup	Yes	All experiments were conducted on a machine equipped with an NVIDIA RTX 4090 GPU and 24 GB of RAM. The black-box classiﬁers used in our evaluation (e.g., Transformer, CNN) are trained independently using standard cross-entropy loss until convergence, with early stopping based on validation accuracy. Unless otherwise speciﬁed, default training settings from each baselines original implementation are followed. To ensure robustness and account for variability in training and explanation outputs, we repeat each experiment across ﬁve random seeds, reporting the mean and standard deviation as error bars. The random seeds affect both model initialization and data shufﬂing. For methods involving sampling-based perturbation (e.g., TIMEX++), the same set of seeds is applied to ensure fair comparison.