Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Toward Interpretable Evaluation Measures for Time Series Segmentation
Authors: Félix Chavelli, Paul Boniol, Michaël Thomazo
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically validate WARI and SMS on synthetic and real-world benchmarks, showing that they not only provide a more accurate assessment of segmentation quality but also uncover insights, such as error provenance and type, that are inaccessible with traditional measures. |
| Researcher Affiliation | Academia | Félix Chavelli Inria, ENS, CNRS, PSL Paris, France EMAIL Boniol Inria, ENS, CNRS, PSL Paris, France paul.boniol@inria.frMichaël Thomazo Inria, ENS, CNRS, PSL Paris, France EMAIL |
| Pseudocode | Yes | Algorithm 1 Optimal State Mapping Algorithm 2 State Matching Score (SMS) |
| Open Source Code | Yes | Finally, we provide an open-source implementation 1 of our measures and evaluation. |
| Open Datasets | Yes | The datasets used in this study are publicly available and can be accessed through the following links: PAMAP2... USC-HAD... UCR-SEG... Act Rec Tut... Mo Cap... |
| Dataset Splits | No | The paper discusses evaluating segmentation measures on existing datasets and methods. While these methods would have their own data splits for training, the paper itself, in its evaluation of the proposed measures, does not specify explicit train/test/validation splits of the time series for its experimental setup. It mentions, "Each time series is treated as an individual test instance." |
| Hardware Specification | Yes | The experiments were conducted on a standard hardware setup including an Intel Core i7 processor and 32GB of RAM. |
| Software Dependencies | No | The ARI and NMI are calculated using the sklearn library in Python, which provides efficient implementations of these measures. The F1-score and covering scores are computed using a custom implementation, adapted from TSSB code. No specific version numbers for Python or sklearn are provided. |
| Experiment Setup | Yes | We evaluated each algorithm on each dataset, using the same hyperparameters as in [20]. Table 3: Parameters for State Detection and Evaluation Measures Component Parameters WARI Weight: α = 0.1 SMS Weights: wdelay = 0.1 wtransition = 0.3 wisolation = 0.8 wmissing = 0.5 |