Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Experimental Comparison and Survey of Twelve Time Series Anomaly Detection Algorithms

Authors: Cynthia Freeman, Jonathan Merriman, Ian Beaver, Abdullah Mueen

JAIR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide a comprehensive experimental validation and survey of twelve anomaly detection methods over different time series characteristics to form guidelines based on several metrics: the AUC (Area Under the Curve), windowed F-score, and Numenta Anomaly Benchmark (NAB) scoring model. We make these analysis by conducting a thorough experimental comparison of a wide range of anomaly detection methods and evaluate them using both windowed F-scores, AUC (Area Under the receiver operating characteristic Curve), and NAB (Numenta Anomaly Benchmark) scores. Created new benchmark datasets for anomaly detection. Compare and contrast multiple scoring methods: windowed F-scoring, AUC, and Numenta Anomaly Benchmark scoring.
Researcher Affiliation	Collaboration	Cynthia Freeman EMAIL Jonathan Merriman EMAIL Ian Beaver EMAIL Verint Intelligent Self-Service 12809 Mirabeau Pkwy, Spokane Valley, WA 99216 Abdullah Mueen EMAIL University of New Mexico Computer Science Department 1901 Redondo S Dr, Albuquerque, NM 87106
Pseudocode	No	The paper describes the methodologies in detail (e.g., for Half-Space Trees, it lists steps like 'Create the workspace', 'Initialize the tree', etc.) but does not present these in a formally structured pseudocode or algorithm block with a clear label.
Open Source Code	Yes	Either re-implemented or used existing libraries to test 12 different anomaly detection methods. 3 See https://github.com/dn3kmc/jair anomaly detection for all source code implementations, Jupyter notebooks demonstrating how to determine characteristics, and datasets.
Open Datasets	Yes	Some datasets come from the Numenta Anomaly Benchmark repository (Numenta, 2018b) which consists of 58 pre-annotated datasets across a wide variety of domains and scripts for evaluating online anomaly detection algorithms. The Numenta Anomaly Benchmark repository also contains code for combining labels from multiple annotators to obtain ground truth. See https://github.com/dn3kmc/jair anomaly detection for all source code implementations, Jupyter notebooks demonstrating how to determine characteristics, and datasets.
Dataset Splits	Yes	For every such annotated dataset, there is a probationary period (ﬁrst 15% of the dataset) where models are allowed to learn normal patterns of behavior. For this reason, no anomalies are labeled in the probationary period.
Hardware Specification	No	The paper describes the algorithms and their performance but does not specify any details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions several software tools and libraries such as "Pyramid (Smith, 2018) in Python", "auto.arima in R", "R forecast library (Hyndman & Khandakar, 2008)", "stlplus package in R (Hafen, 2016)", "Anomaly Detection (Twitter, 2015)", and "Donut (Xu, 2018)". However, it does not provide specific version numbers for these, or the programming languages used (e.g., Python 3.x, R 4.x).
Experiment Setup	Yes	For anomaly detection methods that involve some form of forecasting, we perform grid search on the parameters to minimize the forecasting error. For Facebook Prophet, "We use linear for the growth parameter... For the remaining parameters (changepoint and seasonality prior scales), we use grid search to minimize the mean squared error between the forecast (predictions) and the actual time series values." For VAE (Donut), "The number of latent dimensions is K = 5, the MCMC iteration count is 10, 1024 is the sampling number of Monte Carlo integration, 256 is the batch size, 250 epochs are used, and the optimizer is Adams. As for the structure of the neural network, there are 2 ReLU layers with 100 units, and .01 is the injection ratio. The learning rate is 10^-3 and is discounted by .75 every 10 epochs. L2 regularization is used on the hidden layers with a coeﬃcient of 10^-3." For GLiM, "The exponential forgetting factor, λ, and the step size parameter, η, are chosen via grid search by minimizing the mean squared error between the forecast (predictions) and the actual time series values."