Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Consistent Algorithms for Clustering Time Series
Authors: Azadeh Khaleghi, Daniil Ryabko, Jérémie Mary, Philippe Preux
JMLR 2016 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The theoretical findings of this work are illustrated with experiments on both synthetic and real data. Keywords: clustering, time series, ergodicity, unsupervised learning 6. Experimental Results In this section we present empirical evaluations of Algorithms 1 and 2 on both synthetically generated and real data. 6.1 Synthetic Data We start with synthetic experiments. In order for the experiments to reflect the generality of our approach we have selected time-series distributions that, while being stationary ergodic, cannot be considered as part of any of the usual smaller classes of processes, and are difficult to approximate by finite-state models. Namely, we consider rotation processes used, for example, by Shields (1996) as an example of stationary ergodic processes that are not Bprocesses. Such time series cannot be modelled by a hidden Markov model with a finite or countably infinite set of states. 6.2 Real Data As an application we consider the problem of clustering motion capture sequences, where groups of sequences with similar dynamics are to be identified. Data is taken from the Motion Capture database (MOCAP) which consists of time-series data representing human locomotion. We compare our results to those obtained with two other methods, namely those of Li and Prakash (2011) and Jebara et al. (2007). Note that we have not implemented these reference methods, rather we have taken the numerical results directly from the corresponding articles. In order to have common grounds for each comparison we use the same sets of sequences2 and the same means of evaluation as those used by Li and Prakash (2011); Jebara et al. (2007). In the paper by Li and Prakash (2011) two MOCAP data sets3 are used, where the sequences in each data set are labelled with either running or walking as annotated in the database. Performance is evaluated via the conditional entropy S of the true labelling with respect to the prediction, i.e., S := P i,j Mij P i ,j Mi j log Mij P j Mij where M denotes the clustering confusion matrix. The motion sequences used by Li and Prakash (2011) are reportedly trimmed to equal duration. However, we use the original sequences as our method is not limited by variation in sequence lengths. Table 1 lists performance of Algorithm 1 as well as that reported for the method of Li and Prakash (2011); Algorithm 1 performs consistently better. In the paper (Jebara et al., 2007) four MOCAP data sets4 are used, corresponding to four motions: run, walk, jump and forward jump. Table 2 lists performance in terms of accuracy. The data sets in Table 2 constitute two types of motions: 1. motions that can be considered ergodic: walk, run, run/jog (displayed above the double line), and 2. non-ergodic motions: single jumps (displayed below the double line). As shown in Table 2, Algorithm 1 achieves consistently better performance on the first group of data sets, while being competitive (better on one and worse on the other) on the non-ergodic motions. The time taken to complete each task is in the order of few minutes on a standard laptop computer. |
| Researcher Affiliation | Academia | Azadeh Khaleghi EMAIL Department of Mathematics & Statistics Lancaster University, Lancaster, LA1 4YF, UK Daniil Ryabko EMAIL INRIA Lille 40, avenue de Halley 59650 Villeneuve d Ascq, France J er emie Mary EMAIL Philippe Preux EMAIL Universit e de Lille/CRISt AL (UMR CNRS) 40, avenue de Halley 59650 Villeneuve d Ascq, France |
| Pseudocode | Yes | Algorithm 1 Offline clustering 1: INPUT: sequences S := {x1, , x N}, Number κ of clusters 2: Initialize κ-farthest points as cluster-centres: 3: c1 x1 4: Set C1 {1} as the first cluster 5: for k = 2..κ do 6: ck argmax i=1..N min j=1..k 1 ˆd(xi, xcj), where ties are broken arbitrarily 7: Set Ck {ck} as the kth cluster 8: end for 9: Assign the remaining points to closest centres: 10: for i = 1..N do 11: k argminj Sκ k=1 Ck ˆd(xi, xj) 12: Ck Ck {i} 13: end for 14: OUTPUT: clusters C1, C2, , Cκ Algorithm 2 Online Clustering 1: INPUT: Number κ of target clusters 2: for t = 1.. do 3: Obtain new sequences S(t) = {xt 1, , xt N(t)} 4: Initialize the normalization factor: η 0 5: Initialize the final clusters: Ck(t) , k = 1..κ 6: Generate N(t) κ + 1 candidate cluster centres: 7: for j = κ..N(t) do 8: {Cj 1, . . . , Cj κ} Alg1({xt 1, , xt j}, κ) 9: µk min{i Cj k}, k = 1..κ Set the smallest index as cluster centre. 10: (cj 1, . . . , cj κ) sort(µ1, . . . , µκ) Sort the cluster centres increasingly. 11: γj mink =k 1..κ ˆd(xt cj k, xt cj k ) Calculate performance score. 12: wj 1/j(j + 1) 13: η η + wjγj 14: end for 15: Assign points to clusters: 16: for i = 1..N(t) do 17: k argmink 1..κ 1 η PN(t) j=1 wjγj ˆd(xt i, xt cj k ) 18: Ck(t) Ck(t) {i} 19: end for 20: OUTPUT: {C1(t), , Ck(t)} 21: end for |
| Open Source Code | No | The paper does not contain any explicit statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | As an application we consider the problem of clustering motion capture sequences, where groups of sequences with similar dynamics are to be identified. Data is taken from the Motion Capture database (MOCAP) which consists of time-series data representing human locomotion. ... MOCAP. CMU graphics lab motion capture database. Available at http://mocap.cs.cmu.edu/. |
| Dataset Splits | No | The paper describes the generation of synthetic data and the nature of the real-world Motion Capture (MOCAP) data, but it does not specify explicit training, testing, or validation splits, nor does it refer to standard predefined splits for reproducibility. It only mentions that motion sequences were "trimmed to equal duration" in comparison works but that their method uses "original sequences as our method is not limited by variation in sequence lengths" without detailing the split methodology for their experiments. |
| Hardware Specification | No | The time taken to complete each task is in the order of few minutes on a standard laptop computer. (This mentions a "standard laptop computer" but lacks specific details such as CPU model, RAM, or GPU, which are necessary for hardware reproducibility.) |
| Software Dependencies | No | The paper does not provide specific details or version numbers for any software libraries, programming languages, or tools used in the implementation or experimentation. |
| Experiment Setup | Yes | 6.1.1 Time Series Generation To generate a sequence x = X1..n we proceed as follows: Fix some parameter α (0, 1). Select r0 [0, 1]; then, for each i = 1..n obtain ri by shifting ri 1 by α to the right, and removing the integer part, i.e. ri := ri 1 + α ri 1 + α . The sequence x = (X1, X2, ) is then obtained from ri by thresholding at 0.5, that is Xi := I{ri > 0.5}. If α is irrational then x forms a stationary ergodic time series. (We simulate α by a longdouble with a long mantissa.) For the purpose of our experiments, first we fix κ := 5 difference process distributions specified by α1 = 0.31..., α2 = 0.33..., α3 = 0.35..., α4 = 0.37..., α5 = 0.39.... The parameters αi are intentionally selected to be close, in order to make the process distributions harder to distinguish. Next we generate an N M data matrix X, each row of which is a sequence generated by one of the process distributions. Our task in both the online and the batch setting is to cluster the rows of X into κ = 5 clusters. 6.1.2 Batch Setting In this experiment we demonstrate that in the batch setting, the clustering errors corresponding to both the online and the offline algorithms converge to 0 as the sequence-lengths grow. To this end, at every time step t we take an N n(t) submatrix X|n(t) of X composed of the rows of X terminated at length n(t), where n(t) = 5t. Then at each iteration we let each of the algorithms, (online and offline) cluster the rows of X|n(t) into five clusters, and calculate the clustering error rate of each algorithm. 6.1.3 Online Setting In this experiment we demonstrate that, unlike the online algorithm, the offline algorithm is consistently confused by the new sequences arriving at each time step in an online setting. To simulate an online setting, we proceed as follows: At every time step t, a triangular window is used to reveal the first 1..ni(t), i = 1..t elements of the first t rows of the data matrix X, with ni(t) := 5(t i) + 1, i = 1..t. This gives a total of t sequences, each of length ni(t), for i = 1..t, where the ith sequence for i = 1..t corresponds to the ith row of X terminated at length ni(t). At every time step t the online and offline algorithms are each used in turn to cluster the observed t sequences into five clusters. Note that the performance of both algorithms is measured on all sequences available at a given time, not on a fixed batch of sequences. |