Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Uncertainty-aware Evaluation of Auxiliary Anomalies with the Expected Anomaly Posterior

Authors: Lorenzo Perini, Maja Rudolph, Sabrina Schmedding, Chen Qiu

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimentally on 40 benchmark datasets of images and tabular data, we show that EAP outperforms 12 adapted data quality estimators in the majority of cases.
Researcher Affiliation	Collaboration	Lorenzo Perini EMAIL Bosch Center for AI, Germany DTAI lab & Leuven.AI, KU Leuven, Belgium Maja Rudolph EMAIL Bosch Center for AI, USA University of Wisconsin-Madison, USA Sabrina Schmedding EMAIL Bosch Center for AI, Germany Chen Qiu EMAIL Bosch Center for AI, USA
Pseudocode	No	The paper describes the methodology in Section 3, including equations and theoretical analysis, but does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block, nor a structured procedure formatted like code.
Open Source Code	Yes	Code of EAP is available at: https://github.com/Lorenzo-Perini/Expected Anomaly Posterior.
Open Datasets	Yes	Data. We carry out our study on 40 datasets, including 15 widely used benchmark image datasets (Mv Tec) (Bergmann et al., 2019), 3 industrial image datasets for Surface Defect Inspection (SDI) (Wang et al., 2022), and an additional 22 benchmark tabular datasets for anomaly detection with semantically useful anomalies, commonly referenced in the literature (Han et al., 2022b).
Dataset Splits	Yes	Setup. For each dataset, we proceed as follows: (i) We create a balanced test set by adding random normal examples and 50% of available anomalies; (ii) We generate a set of l auxiliary anomalies as described above with l 3 = 40% of available anomalies; (iii) We create a training set by adding 10% of available anomalies and all remaining normal examples to the training set.
Hardware Specification	No	To run all experiments, we use an internal cluster of six 24or 32-thread machines (128 GB of memory).
Software Dependencies	No	The paper mentions using 'SSDO (Vercruyssen et al., 2018)' and 'Isolation Forest (Liu et al., 2008)' as the underlying anomaly detector and prior, respectively, and 'pre-trained Vi T-B-16-Sig LIP (Zhai et al., 2023)' for feature extraction. However, it does not provide specific version numbers for these software components or any other libraries used.
Experiment Setup	Yes	For all baselines, we use SSDO (Vercruyssen et al., 2018) as the underlying anomaly detector f with k = 10 and Isolation Forest (Liu et al., 2008) as prior. This choice is motivated as follows. First, such a combination has been analyzed and used often by researchers (Drogkoula et al., 2023; Stradiotti et al., 2024a; Serban et al., 2024; Pang et al., 2023; Stradiotti et al., 2024b). Second, because some of our datasets include tabular data, we need to employ a fast yet accurate detector for such a data modality. Recent papers such as (Stradiotti et al., 2024b) highlight that SSDO + Isolation Forest is one of the best-performing detectors. When exposed to selected auxiliary anomalies, we employ an SVM with RBF kernel (for images) and a Random Forest (for tabular data) to make the normal vs. abnormal classification. For images, we use the pre-trained Vi T-B-16-Sig LIP (Zhai et al., 2023) to extract the features from images and use them as inputs to EAP and all baselines. Our method EAP has one hyperparameter, namely the prior α0, β0, which we set to m n (the proportion of anomalies in the training set) and 1 m n . Intuitively, this corresponds to the expected proportion of (real) anomalies if an external dataset was sampled from P(X, Y ). The baselines6 have the following hyperparameters: k NNShap and Rarity have k = 10, Data Banzhaf, AME, Inf and Data Oob use 50 models. All other hyperparameters are set as default (Soenen et al., 2021).