Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Experimental Comparison and Survey of Twelve Time Series Anomaly Detection Algorithms

Authors: Cynthia Freeman, Jonathan Merriman, Ian Beaver, Abdullah Mueen

JAIR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide a comprehensive experimental validation and survey of twelve anomaly detection methods over different time series characteristics to form guidelines based on several metrics: the AUC (Area Under the Curve), windowed F-score, and Numenta Anomaly Benchmark (NAB) scoring model. We make these analysis by conducting a thorough experimental comparison of a wide range of anomaly detection methods and evaluate them using both windowed F-scores, AUC (Area Under the receiver operating characteristic Curve), and NAB (Numenta Anomaly Benchmark) scores. Created new benchmark datasets for anomaly detection. Compare and contrast multiple scoring methods: windowed F-scoring, AUC, and Numenta Anomaly Benchmark scoring.
Researcher Affiliation Collaboration Cynthia Freeman EMAIL Jonathan Merriman EMAIL Ian Beaver EMAIL Verint Intelligent Self-Service 12809 Mirabeau Pkwy, Spokane Valley, WA 99216 Abdullah Mueen EMAIL University of New Mexico Computer Science Department 1901 Redondo S Dr, Albuquerque, NM 87106
Pseudocode No The paper describes the methodologies in detail (e.g., for Half-Space Trees, it lists steps like 'Create the workspace', 'Initialize the tree', etc.) but does not present these in a formally structured pseudocode or algorithm block with a clear label.
Open Source Code Yes Either re-implemented or used existing libraries to test 12 different anomaly detection methods. 3 See https://github.com/dn3kmc/jair anomaly detection for all source code implementations, Jupyter notebooks demonstrating how to determine characteristics, and datasets.
Open Datasets Yes Some datasets come from the Numenta Anomaly Benchmark repository (Numenta, 2018b) which consists of 58 pre-annotated datasets across a wide variety of domains and scripts for evaluating online anomaly detection algorithms. The Numenta Anomaly Benchmark repository also contains code for combining labels from multiple annotators to obtain ground truth. See https://github.com/dn3kmc/jair anomaly detection for all source code implementations, Jupyter notebooks demonstrating how to determine characteristics, and datasets.
Dataset Splits Yes For every such annotated dataset, there is a probationary period (first 15% of the dataset) where models are allowed to learn normal patterns of behavior. For this reason, no anomalies are labeled in the probationary period.
Hardware Specification No The paper describes the algorithms and their performance but does not specify any details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions several software tools and libraries such as "Pyramid (Smith, 2018) in Python", "auto.arima in R", "R forecast library (Hyndman & Khandakar, 2008)", "stlplus package in R (Hafen, 2016)", "Anomaly Detection (Twitter, 2015)", and "Donut (Xu, 2018)". However, it does not provide specific version numbers for these, or the programming languages used (e.g., Python 3.x, R 4.x).
Experiment Setup Yes For anomaly detection methods that involve some form of forecasting, we perform grid search on the parameters to minimize the forecasting error. For Facebook Prophet, "We use linear for the growth parameter... For the remaining parameters (changepoint and seasonality prior scales), we use grid search to minimize the mean squared error between the forecast (predictions) and the actual time series values." For VAE (Donut), "The number of latent dimensions is K = 5, the MCMC iteration count is 10, 1024 is the sampling number of Monte Carlo integration, 256 is the batch size, 250 epochs are used, and the optimizer is Adams. As for the structure of the neural network, there are 2 ReLU layers with 100 units, and .01 is the injection ratio. The learning rate is 10^-3 and is discounted by .75 every 10 epochs. L2 regularization is used on the hidden layers with a coefficient of 10^-3." For GLiM, "The exponential forgetting factor, λ, and the step size parameter, η, are chosen via grid search by minimizing the mean squared error between the forecast (predictions) and the actual time series values."