Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Experimental Comparison and Survey of Twelve Time Series Anomaly Detection Algorithms
Authors: Cynthia Freeman, Jonathan Merriman, Ian Beaver, Abdullah Mueen
JAIR 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide a comprehensive experimental validation and survey of twelve anomaly detection methods over different time series characteristics to form guidelines based on several metrics: the AUC (Area Under the Curve), windowed F-score, and Numenta Anomaly Benchmark (NAB) scoring model. We make these analysis by conducting a thorough experimental comparison of a wide range of anomaly detection methods and evaluate them using both windowed F-scores, AUC (Area Under the receiver operating characteristic Curve), and NAB (Numenta Anomaly Benchmark) scores. Created new benchmark datasets for anomaly detection. Compare and contrast multiple scoring methods: windowed F-scoring, AUC, and Numenta Anomaly Benchmark scoring. |
| Researcher Affiliation | Collaboration | Cynthia Freeman EMAIL Jonathan Merriman EMAIL Ian Beaver EMAIL Verint Intelligent Self-Service 12809 Mirabeau Pkwy, Spokane Valley, WA 99216 Abdullah Mueen EMAIL University of New Mexico Computer Science Department 1901 Redondo S Dr, Albuquerque, NM 87106 |
| Pseudocode | No | The paper describes the methodologies in detail (e.g., for Half-Space Trees, it lists steps like 'Create the workspace', 'Initialize the tree', etc.) but does not present these in a formally structured pseudocode or algorithm block with a clear label. |
| Open Source Code | Yes | Either re-implemented or used existing libraries to test 12 different anomaly detection methods. 3 See https://github.com/dn3kmc/jair anomaly detection for all source code implementations, Jupyter notebooks demonstrating how to determine characteristics, and datasets. |
| Open Datasets | Yes | Some datasets come from the Numenta Anomaly Benchmark repository (Numenta, 2018b) which consists of 58 pre-annotated datasets across a wide variety of domains and scripts for evaluating online anomaly detection algorithms. The Numenta Anomaly Benchmark repository also contains code for combining labels from multiple annotators to obtain ground truth. See https://github.com/dn3kmc/jair anomaly detection for all source code implementations, Jupyter notebooks demonstrating how to determine characteristics, and datasets. |
| Dataset Splits | Yes | For every such annotated dataset, there is a probationary period (first 15% of the dataset) where models are allowed to learn normal patterns of behavior. For this reason, no anomalies are labeled in the probationary period. |
| Hardware Specification | No | The paper describes the algorithms and their performance but does not specify any details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions several software tools and libraries such as "Pyramid (Smith, 2018) in Python", "auto.arima in R", "R forecast library (Hyndman & Khandakar, 2008)", "stlplus package in R (Hafen, 2016)", "Anomaly Detection (Twitter, 2015)", and "Donut (Xu, 2018)". However, it does not provide specific version numbers for these, or the programming languages used (e.g., Python 3.x, R 4.x). |
| Experiment Setup | Yes | For anomaly detection methods that involve some form of forecasting, we perform grid search on the parameters to minimize the forecasting error. For Facebook Prophet, "We use linear for the growth parameter... For the remaining parameters (changepoint and seasonality prior scales), we use grid search to minimize the mean squared error between the forecast (predictions) and the actual time series values." For VAE (Donut), "The number of latent dimensions is K = 5, the MCMC iteration count is 10, 1024 is the sampling number of Monte Carlo integration, 256 is the batch size, 250 epochs are used, and the optimizer is Adams. As for the structure of the neural network, there are 2 ReLU layers with 100 units, and .01 is the injection ratio. The learning rate is 10^-3 and is discounted by .75 every 10 epochs. L2 regularization is used on the hidden layers with a coefficient of 10^-3." For GLiM, "The exponential forgetting factor, λ, and the step size parameter, η, are chosen via grid search by minimizing the mean squared error between the forecast (predictions) and the actual time series values." |