reproducibilityindex.ai

SemanticMask: A Contrastive View Design for Anomaly Detection in Tabular Data

Authors: Shuting Tao, Tongtian Zhu, Hongwei Wang, Xiangming Meng

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiment results validate the superiority of Semantic Mask over the state-of-the-art anomaly detection methods and existing augmentation techniques for tabular data.
Researcher Affiliation	Academia	1College of Computer Science and Technology, Zhejiang University 2The Zhejiang University-University of Illinois Urbana-Champaign Institute, Zhejiang University
Pseudocode	No	The paper includes a block diagram (Figure 1) and mathematical formulations but does not provide any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The source code and appendix are available on Git Hub at https://github.com/TST826/Semantic Mask.
Open Datasets	Yes	We conduct experiments on nine datasets with column names sourced from the Outlier Detection Data Sets (ODDS) [Rayana, 2016], the KEEL datasets [Derrac et al., 2015] and the UCI datasets [Markelle et al., 2013].
Dataset Splits	Yes	We train our method on a random selected 50% subset of the normal data. The validation set, consisting of 25% normal data, is used to determine the threshold. The methods are then tested on the remaining normal data and all anomalous samples.
Hardware Specification	No	The paper does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for conducting the experiments.
Software Dependencies	No	The paper mentions software components like Sentence-BERT and Adam optimizer, but does not provide specific version numbers for any libraries, frameworks, or programming languages used in the implementation or experimentation.
Experiment Setup	Yes	For Semantic Mask and its variants, λ is set to 0.5, pm is selected from the set {0.4, 0.5, 0.6}. For Semantic Mask+description, ϵ is set to 0.1. We set k of k-means proportionally to the feature dimension d. For d < 18, k = 2. For 18 d < 100, k = 3. For complex datasets such as Arrhythmia [Rayana, 2016], where d ≥ 100, k = d/100 + 3, features are partitioned into k clusters, forming two disjoint subsets with k/2 clusters each. Contrastive loss uses a constant temperature τ of 0.01. The threshold for identifying anomalies is determined by the 85th quantiles of the Mahalanobis distance in the validation set. The encoder is a multilayer perceptron consisting of two hidden layers with 128 and 64 hidden units, along with the Re LU activation layer. The encoder is trained using the Adam optimizer with a learning rate of 0.001 and default values for other hyperparameters.