Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Toward Interpretable Evaluation Measures for Time Series Segmentation

Authors: Félix Chavelli, Paul Boniol, Michaël Thomazo

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically validate WARI and SMS on synthetic and real-world benchmarks, showing that they not only provide a more accurate assessment of segmentation quality but also uncover insights, such as error provenance and type, that are inaccessible with traditional measures.
Researcher Affiliation Academia Félix Chavelli Inria, ENS, CNRS, PSL Paris, France EMAIL Boniol Inria, ENS, CNRS, PSL Paris, France paul.boniol@inria.frMichaël Thomazo Inria, ENS, CNRS, PSL Paris, France EMAIL
Pseudocode Yes Algorithm 1 Optimal State Mapping Algorithm 2 State Matching Score (SMS)
Open Source Code Yes Finally, we provide an open-source implementation 1 of our measures and evaluation.
Open Datasets Yes The datasets used in this study are publicly available and can be accessed through the following links: PAMAP2... USC-HAD... UCR-SEG... Act Rec Tut... Mo Cap...
Dataset Splits No The paper discusses evaluating segmentation measures on existing datasets and methods. While these methods would have their own data splits for training, the paper itself, in its evaluation of the proposed measures, does not specify explicit train/test/validation splits of the time series for its experimental setup. It mentions, "Each time series is treated as an individual test instance."
Hardware Specification Yes The experiments were conducted on a standard hardware setup including an Intel Core i7 processor and 32GB of RAM.
Software Dependencies No The ARI and NMI are calculated using the sklearn library in Python, which provides efficient implementations of these measures. The F1-score and covering scores are computed using a custom implementation, adapted from TSSB code. No specific version numbers for Python or sklearn are provided.
Experiment Setup Yes We evaluated each algorithm on each dataset, using the same hyperparameters as in [20]. Table 3: Parameters for State Detection and Evaluation Measures Component Parameters WARI Weight: α = 0.1 SMS Weights: wdelay = 0.1 wtransition = 0.3 wisolation = 0.8 wmissing = 0.5