Statistical Analysis of Nearest Neighbor Methods for Anomaly Detection

Authors: Xiaoyi Gu, Leman Akoglu, Alessandro Rinaldo

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We first show through extensive simulations that NN methods compare favorably to some of the other state-of-the-art algorithms for anomaly detection based on a set of benchmark synthetic datasets. We further consider the performance of NN methods on real datasets, and relate it to the dimensionality of the problem. Next, we analyze the theoretical properties of NN-methods for anomaly detection by studying a more general quantity called distance-to-measure (DTM), originally developed in the literature on robust geometric and topological inference.
Researcher Affiliation Academia 1Department of Statistics and Data Science, Carnegie Mellon University 2Heinz College of Information Systems and Public Policy, Carnegie Mellon University {xgu1,lakoglu}@andrew.cmu.edu, arinaldo@cmu.edu
Pseudocode No The paper describes the methods verbally and mathematically but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes The code for all our experiments are publicly available2. 2https://github.com/xgu1/DTM
Open Datasets Yes Next, we compare the performance of IForest, LODA, LOF, DTM2, k NN and kth NN on 23 real datasets from the ODDS library [25]. [25] Shebuti Rayana. ODDS library. http://odds.cs.stonybrook.edu, 2016. We consider six high dimensional real datasets from the UCI library [26] (see [12] for details) [26] A. Frank and A. Asuncion. Uci machine learning repository. http://archive.ics.uci. edu/ml, 2010.
Dataset Splits No The paper mentions evaluating methods on benchmark datasets and real datasets, but it does not provide specific details on how these datasets were split into training, validation, and test sets (e.g., percentages, sample counts, or a detailed splitting methodology).
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments.
Software Dependencies No The paper mentions various algorithms and methods but does not specify any software dependencies (e.g., libraries, frameworks, or programming languages) with version numbers.
Experiment Setup Yes For all our experiments, we set the following hyperparameters for our models: sub-sampling size = 256 and the number of trees = 100 for IForest; k = 0.03 (sample size) for all distance based methods for comparable results; for LODA, we use 100 projections with each projection using approximately d features.