Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Self-supervise, Refine, Repeat: Improving Unsupervised Anomaly Detection

Authors: Jinsung Yoon, Kihyuk Sohn, Chun-Liang Li, Sercan O Arik, Chen-Yu Lee, Tomas Pfister

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments across various datasets from diﬀerent domains, including semantic AD (CIFAR-10 (Krizhevsky & Hinton, 2009), Dog-vs-Cat (Elson et al., 2007)), real-world manufacturing visual AD use case (MVTec (Bergmann et al., 2019)), and real-world tabular AD benchmarks (e.g., detecting medical or network anomalies). We evaluate models at diﬀerent anomaly ratios of unlabeled training data and show that SRR signiﬁcantly boosts performance.
Researcher Affiliation	Industry	EMAIL Google Cloud AI
Pseudocode	Yes	Algorithm 1 SRR: Self-supervise, Reﬁne, Repeat. Input: Train data D = {xi}N i=1, Ensemble count (K), threshold (γ) Output: Reﬁned data ( ˆD), trained OCC (f), feature extractor (g)
Open Source Code	No	The paper does not contain an explicit statement about the release of source code for the methodology described, nor does it provide a direct link to a code repository.
Open Datasets	Yes	We conduct extensive experiments across various datasets from diﬀerent domains, including semantic AD (CIFAR-10 (Krizhevsky & Hinton, 2009), Dog-vs-Cat (Elson et al., 2007)), real-world manufacturing visual AD use case (MVTec (Bergmann et al., 2019)), and real-world tabular AD benchmarks (e.g., detecting medical or network anomalies). Following (Zong et al., 2018; Bergman & Hoshen, 2019), we test the performance of SRR on a variety of real-world tabular AD datasets, including network (KDDCup) and medical (Thyroid, Arrhythmia) AD from the UCI repository (Asuncion & Newman, 2007).
Dataset Splits	Yes	To construct the data splits, we utilize 50% of normal samples for training. In addition, we hold out some anomaly samples (amounting to 10% of the normal samples) from the data. This allows to simulate unsupervised settings with an anomaly ratio of up to 10% of entire training set. Rest of the data is used for testing. For MVTec, since there are no anomalous data available for training, we borrow 10% of the anomalies from the test set and swap them with normal samples in the training set.
Hardware Specification	Yes	Each experimental run is performed on a single V100 GPU.
Software Dependencies	No	The paper mentions using specific models (e.g., Res Net-18 architecture) and optimizers (Momentum SGD) but does not provide specific version numbers for software libraries or programming languages.
Experiment Setup	Yes	The same model and hyperparameter conﬁgurations are used for SRR with K = 5 classiﬁers in the ensemble. We set γ as twice the anomaly ratio of training data. For 0% anomaly ratio, we set γ as 0.5. Finally, a Gaussian Density Estimator (GDE) on learned representations is used as the OCC. Optimizer Momentum SGD (momentum= 0.9) Learning rate 0.001 Batch size 64 M L2 weight regularization 0.00003 Random projection dimension 32