Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Uncertainty-aware Evaluation of Auxiliary Anomalies with the Expected Anomaly Posterior
Authors: Lorenzo Perini, Maja Rudolph, Sabrina Schmedding, Chen Qiu
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimentally on 40 benchmark datasets of images and tabular data, we show that EAP outperforms 12 adapted data quality estimators in the majority of cases. |
| Researcher Affiliation | Collaboration | Lorenzo Perini EMAIL Bosch Center for AI, Germany DTAI lab & Leuven.AI, KU Leuven, Belgium Maja Rudolph EMAIL Bosch Center for AI, USA University of Wisconsin-Madison, USA Sabrina Schmedding EMAIL Bosch Center for AI, Germany Chen Qiu EMAIL Bosch Center for AI, USA |
| Pseudocode | No | The paper describes the methodology in Section 3, including equations and theoretical analysis, but does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block, nor a structured procedure formatted like code. |
| Open Source Code | Yes | Code of EAP is available at: https://github.com/Lorenzo-Perini/Expected Anomaly Posterior. |
| Open Datasets | Yes | Data. We carry out our study on 40 datasets, including 15 widely used benchmark image datasets (Mv Tec) (Bergmann et al., 2019), 3 industrial image datasets for Surface Defect Inspection (SDI) (Wang et al., 2022), and an additional 22 benchmark tabular datasets for anomaly detection with semantically useful anomalies, commonly referenced in the literature (Han et al., 2022b). |
| Dataset Splits | Yes | Setup. For each dataset, we proceed as follows: (i) We create a balanced test set by adding random normal examples and 50% of available anomalies; (ii) We generate a set of l auxiliary anomalies as described above with l 3 = 40% of available anomalies; (iii) We create a training set by adding 10% of available anomalies and all remaining normal examples to the training set. |
| Hardware Specification | No | To run all experiments, we use an internal cluster of six 24or 32-thread machines (128 GB of memory). |
| Software Dependencies | No | The paper mentions using 'SSDO (Vercruyssen et al., 2018)' and 'Isolation Forest (Liu et al., 2008)' as the underlying anomaly detector and prior, respectively, and 'pre-trained Vi T-B-16-Sig LIP (Zhai et al., 2023)' for feature extraction. However, it does not provide specific version numbers for these software components or any other libraries used. |
| Experiment Setup | Yes | For all baselines, we use SSDO (Vercruyssen et al., 2018) as the underlying anomaly detector f with k = 10 and Isolation Forest (Liu et al., 2008) as prior. This choice is motivated as follows. First, such a combination has been analyzed and used often by researchers (Drogkoula et al., 2023; Stradiotti et al., 2024a; Serban et al., 2024; Pang et al., 2023; Stradiotti et al., 2024b). Second, because some of our datasets include tabular data, we need to employ a fast yet accurate detector for such a data modality. Recent papers such as (Stradiotti et al., 2024b) highlight that SSDO + Isolation Forest is one of the best-performing detectors. When exposed to selected auxiliary anomalies, we employ an SVM with RBF kernel (for images) and a Random Forest (for tabular data) to make the normal vs. abnormal classification. For images, we use the pre-trained Vi T-B-16-Sig LIP (Zhai et al., 2023) to extract the features from images and use them as inputs to EAP and all baselines. Our method EAP has one hyperparameter, namely the prior α0, β0, which we set to m n (the proportion of anomalies in the training set) and 1 m n . Intuitively, this corresponds to the expected proportion of (real) anomalies if an external dataset was sampled from P(X, Y ). The baselines6 have the following hyperparameters: k NNShap and Rarity have k = 10, Data Banzhaf, AME, Inf and Data Oob use 50 models. All other hyperparameters are set as default (Soenen et al., 2021). |