Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
An Efficient One-Class SVM for Novelty Detection in IoT
Authors: Kun Yang, Samory Kpotufe, Nick Feamster
TMLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This work designs, implements, and evaluates an efficient OCSVM for such practical settings. We extend Nyström and (Gaussian) Sketching approaches to OCSVM, combining these methods with clustering and Gaussian mixture models to achieve 15-30x speedup in prediction time and 30-40x reduction in memory requirements without sacrificing detection accuracy. ... We implement the above described approach based on mapping the normal training data ... We evaluate OC-Nyström and OC-KJL, both with and without automatic GMM parameter selection, on multiple IoT datasets ... We observe typical detection time speedups (w.r.t. the baseline OCSVM) between 14 to 20 times faster using either OC-Nyström or OC-KJL, and 40+ times for some datasets. |
| Researcher Affiliation | Academia | Kun Yang EMAIL Columbia University Samory Kpotufe EMAIL Columbia University Nick Feamster EMAIL University of Chicago |
| Pseudocode | Yes | Meta Procedures. The resulting OC-Nyström and OC-KJL approaches are summarized below. Given a Gaussian kernel K with bandwidth h, embedding choices m, d training size n: Training: Given normal data {Xi}n i=1 RD do: Embed Xi s as ϕ (Xi) Rd via Nyström equation 1 or KJL equation 2; Parameter k is passed in or is chosen via Quickshift++ (see paragraph below) on embedded data {ϕ (Xi)}n i=1 Rd; Estimate a GMM f with k components on {ϕ (Xi)}n i=1; Return GMM f along with projection ϕ (i.e., matrix P and subsample Sm) ; Detection: Given new x RD and model (ϕ , f), do: Embed x as ϕ (x) into Rd; Flag x as novelty iff f(ϕ (x)) threshold t |
| Open Source Code | Yes | All the source codes can be seen at KJL KJL. An efficient one-class svm for novelty detection source code. https://github.com/kun0906/kjl. |
| Open Datasets | Yes | Table 1 describes the datasets we used in the main paper, along with the associated types of novelty being detected. There are seven datasets in total, in which three of them are IoT datasets collected from three IoT devices deployed at the University of Chicago, and the remaining four are public datasets (i.e., CTU IoT, UNB IDS, MAWI, and MACCDC). CTU IoT (García, 2019) UNB IDS (Sharafaldin et al., 2018) MAWI (Naga & Kaizaki, 2020) MACCDC (O Brien et al., 2012) |
| Dataset Splits | Yes | We randomly split the obtained data into training, validation, and test sets of sizes detailed in Table B.1 in Appendix. (ii) Repeat 5 times for accurate AUC: Draw a subsample of size n = 10K from normal data to form the training data, except for MAWI (n = 5.7K). If tuning: draw a validation sample (1/4 test set size). Choose parameters h, k as described in Section 5.1. Train with the choice of h, k and save model on disk. Load and test model on Test data: repeat 100 times for accurate timing on machine (retain aggregate time). Table B.1: Dataset sizes (# of data points) and dimensions. Dataset UNB CTU MAWI MACCDC SFRIG AECHO DWSHR Train Set 10,000 10,000 5,720 10,000 10,000 10,000 10,000 Val. Set 462 1,250 1,040 1,250 1,250 280 508 Test Set 1,854 5,000 4,160 5,000 5,000 1,120 2,032 |
| Hardware Specification | Yes | We perform our experiments on two computing platforms: (1) a well-provisioned server, for the use case where all training and detection might occur offline; and (2) resource-constrained devices, specifically a Raspberry Pi and an NVidia Jetson Nano, corresponding to the use case where detection is to be real-time, local to the IoT device. Table A.1 in Appendix provides details. Table A.1: We train on server and test on all 3 machines. Large Server 64-bit, running Debian GNU/Linux 9 (stretch) with Intel(R) Xeon(R) processor (32 CPU Cores, 1200-3400 MHz each), 100GB memory, and 2TB disk. Raspberry Pi 32-bit, running Raspbian GNU/Linux 10 (buster) with Cortex-A72 processor (4 CPU cores, 600-1500 MHz each), 8GB memory, and 27GB disk. Nvidia Nano 64-bit, running Ubuntu 18.04.5 LTS (Bionic Beaver) with Cortex-A57 processor (4 CPU cores, 102-1479 MHz each), 4GB memory, and 30GB disk. |
| Software Dependencies | Yes | All detection procedures are implemented in Python, calling on the scikit-learn package for existing procedures such as OCSVM and GMM. ... Numpy to ensure fair, apples-to-apples execution time comparison with OC-Nyström and OC-KJL, which are implemented in Numpy ... Table A.1: ... Large Server ... Programming language: Python 3.7.9. Numpy 1.19.2 ... Raspberry Pi ... Programming language: Python 3.7.3 with enable-optimizations option. Numpy 1.18.2 ... Nvidia Nano ... Programming language: Python 3.7.3 with enable-optimizations option. Numpy 1.18.2 ... |
| Experiment Setup | Yes | Kernel Bandwidth h. For all methods, i.e., OCSVM, OC-Nyström, and OC-KJL, we use a Gaussian kernel of the form K(x, x ) exp( x x 2/h2), where the bandwidth h is to be picked as a quantile of n 2 distances between the n training data points. In all our results, we consider 10 quantiles [0.1, 0.2, . . . , 0.9] {0.95} of increasing interpoint distances. Number of GMM components k. ... We consider choices in the range [1, 4, 6, 8, 10, 12, 14, 16, 18, 20]. Projection Parameters. As discussed in Section 4, subsamples size m and projection dimension d are fixed to m = 100 and d = 5, choices which remarkably preserve detection performance across datasets and types of novelty, despite the considerable amount of compression they entail. Quickshift++ Parameters. We use the implementation of (Jiang et al., 2018; Jiang et al.), which requires internal parameters β set to 0.9 (this performs density smoothing) and the number of neighbors set to n2/3 (to build a dense neighborhood graph whose connectivity encodes high-density regions), two choices that work well across device datasets and types of novelty. Gaussian Mixture Models Parameters. ... When using Quickshift++, we initialize GMM with the clusters returned, i.e., local means and covariances of these clusters, and train till convergence. |