Statistical Analysis of Nearest Neighbor Methods for Anomaly Detection
Authors: Xiaoyi Gu, Leman Akoglu, Alessandro Rinaldo
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first show through extensive simulations that NN methods compare favorably to some of the other state-of-the-art algorithms for anomaly detection based on a set of benchmark synthetic datasets. We further consider the performance of NN methods on real datasets, and relate it to the dimensionality of the problem. Next, we analyze the theoretical properties of NN-methods for anomaly detection by studying a more general quantity called distance-to-measure (DTM), originally developed in the literature on robust geometric and topological inference. |
| Researcher Affiliation | Academia | 1Department of Statistics and Data Science, Carnegie Mellon University 2Heinz College of Information Systems and Public Policy, Carnegie Mellon University {xgu1,lakoglu}@andrew.cmu.edu, arinaldo@cmu.edu |
| Pseudocode | No | The paper describes the methods verbally and mathematically but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code for all our experiments are publicly available2. 2https://github.com/xgu1/DTM |
| Open Datasets | Yes | Next, we compare the performance of IForest, LODA, LOF, DTM2, k NN and kth NN on 23 real datasets from the ODDS library [25]. [25] Shebuti Rayana. ODDS library. http://odds.cs.stonybrook.edu, 2016. We consider six high dimensional real datasets from the UCI library [26] (see [12] for details) [26] A. Frank and A. Asuncion. Uci machine learning repository. http://archive.ics.uci. edu/ml, 2010. |
| Dataset Splits | No | The paper mentions evaluating methods on benchmark datasets and real datasets, but it does not provide specific details on how these datasets were split into training, validation, and test sets (e.g., percentages, sample counts, or a detailed splitting methodology). |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions various algorithms and methods but does not specify any software dependencies (e.g., libraries, frameworks, or programming languages) with version numbers. |
| Experiment Setup | Yes | For all our experiments, we set the following hyperparameters for our models: sub-sampling size = 256 and the number of trees = 100 for IForest; k = 0.03 (sample size) for all distance based methods for comparable results; for LODA, we use 100 projections with each projection using approximately d features. |