Dissect Black Box: Interpreting for Rule-Based Explanations in Unsupervised Anomaly Detection

Authors: Yu Zhang, Ruoyu Li, Nengwu Wu, Qing Li, Xinhan Lin, Yang Hu, Tao Li, Yong Jiang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In the experimental phase, we meticulously tested the capabilities of our model, focusing on its ability to autonomously extract rules from black-box anomaly detection models and evaluate its accuracy in detecting anomalies. ... The results demonstrate that our method not only excels in extracting interpretable and precise rules from black-box models but also outperforms in terms of fidelity, robustness, and detection rates (true positive and true negative rates).
Researcher Affiliation Collaboration Shanghai Artificial Intelligence Laboratory, China; Peng Cheng Laboratory, China College of Computer Science and Software Engineering, Shenzhen University, China Tsinghua University, China; Tsinghua Shenzhen International Graduate School, China; Hunan University of Science and Technology, China
Pseudocode Yes Algorithm 1 Implementation of SCD-Tree and Gaussian Process for Boundary Delineation
Open Source Code Yes Details on model hyperparameters, dataset splits, and the random seeds used to ensure reproducibility will be available through an anonymous repository, providing full access to the scripts and setup used in our research. ... Instructions for accessing the data and code will be made available through an anonymous repository, detailed in the "Implementation of Experiment" subsection (see 6.1). At the same time, we also submitted our data and code in the Supplementary Material zip package, some of the datasets are quite large, we only intercepted some important data, and in the dataset folder Tianjia "readme.txt" file, the public links to the dataset and the use of the method to provide a detailed description.
Open Datasets Yes We employ four distinct datasets to evaluate our method across various security-related domains. These datasets include Malicious and Benign Webpages [45] for web security, KDDCup [46] for classic network intrusion scenarios [47], CIC-IDS [48] for modern network attacks, and TON-Io T [49], which integrates Io T and traditional network data.
Dataset Splits Yes These tabular format datasets are systematically partitioned into training, validation, and testing segments following an 8:1:1 ratio split.
Hardware Specification Yes The computational infrastructure is centered around a high-capacity server featuring an Intel(R) Xeon(R) Gold 5318Y CPU @ 2.10GHz with 527GB of RAM, supporting the intensive computational demands of our experiments. Additionally, an NVIDIA Ge Force RTX 3090 Super with 24GB VRAM is utilized specifically for the computationally intensive tasks of training our deep learning models.
Software Dependencies Yes Our implementation utilizes Py Torch (version 2.1.0) to facilitate the development and training of deep learning models, specifically Autoencoders (AE) and Variational Autoencoders (VAE), which are crucial for our anomaly detection tasks. Complementing this, scikit-learn (version 1.1.3) is employed for essential preprocessing, feature engineering, and the evaluation of models. The entire system is orchestrated using Python (version 3.8.18), chosen for its extensive libraries that streamline data manipulation and experimental workflows.
Experiment Setup Yes Details on model hyperparameters, dataset splits, and the random seeds used to ensure reproducibility will be available through an anonymous repository, providing full access to the scripts and setup used in our research." and "We detail the hyperparameter tuning process for the SCD-Tree and GBD algorithms, focusing on the influence of feature count and sample size on performance metrics.