Estimating the Contamination Factor’s Distribution in Unsupervised Anomaly Detection

Authors: Lorenzo Perini, Paul-Christian Bürkner, Arto Klami

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically on 22 datasets, we show that the estimated distribution is well-calibrated and that setting the threshold using the posterior mean improves the detectors performance over several alternative methods.
Researcher Affiliation Academia 1DTAI lab & Leuven.AI, Department of Computer Science, KU Leuven, Belgium 2Cluster of Excellence Sim Tech, University of Stuttgart, Germany 3Department of Computer Science, University of Helsinki, Finland.
Pseudocode No The paper describes the proposed method in detail across multiple subsections (3.1, 3.2, 3.3, 3.4) but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes Code and online Supplement are available at: https://github.com/Lorenzo-Perini/Gamma GMM
Open Datasets Yes We carry out our study on 20 commonly used benchmark datasets and additionally 2 (proprietary) real tasks. The benchmark datasets contain semantically useful anomalies widely used in the literature (Campos et al., 2016).
Dataset Splits No In the experiments we assume a transductive setting (Campos et al., 2016; Scott & Blanchard, 2008; Toron et al., 2022), where a dataset D is used both for training and testing.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, memory, or processor types used for running the experiments.
Software Dependencies No All these methods are implemented in the python library Py OD (Zhao et al., 2019b). The threshold estimators are implemented in PYTHRESH2 with default hyperparameters. Finally, the DPGMM is implemented in SKLEARN - no version numbers are provided for these libraries.
Experiment Setup Yes Our method introduces two new hyperparameters: p0 and phigh. We both of them set to 0.01 as default value because extremely high contamination, as well as no anomalies, are unlikely events. We use 10 anomaly detectors with different inductive biases (Soenen et al., 2021)... We set the means prior to 0, and the covariance matrices prior to identities of appropriate dimension.