Efficient Anomaly Detection via Matrix Sketching
Authors: Vatsal Sharan, Parikshit Gopalan, Udi Wieder
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results. Our results have a parameter ℓthat controls the size and the accuracy of the sketch. While our theorems imply that ℓcan be chosen independent of d, they depend polynomially on k, the desired accuracy and other parameters, and are probably pessimistic. We validate both our algorithms on real-world data. In our experiments, we found that choosing ℓto be a small multiple of k was sufficient to get good results. Our results show that one can get outcomes comparable to running full-blown SVD using sketches which are significantly smaller in memory footprint, faster to compute and easy to implement (literally a few lines of Python code). |
| Researcher Affiliation | Collaboration | Vatsal Sharan Stanford University vsharan@stanford.edu Parikshit Gopalan VMware Research pgopalan@vmware.com Udi Wieder VMware Research uwieder@vmware.com |
| Pseudocode | Yes | Algorithm 1: Algorithm to approximate anomaly scores using Frequent Directions; Algorithm 2: Algorithm to approximate anomaly scores using random projection |
| Open Source Code | No | The paper mentions an existing implementation: "Since the existing implementation of Frequent Directions [35] does not seem to handle sparse matrices." (footnote 4) and provides a link for it: "Edo Liberty and Mina Ghashami. https://github.com/edoliberty/frequent-directions." However, it does not explicitly state that the code *for their work* described in the paper is open-source or provide a link to it. |
| Open Datasets | Yes | Datasets: We ran experiments on three publicly available datasets: p53 mutants [32], Dorothea [33] and RCV1 [34], all of which are available from the UCI Machine Learning Repository, and are high dimensional (d > 5000). |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning into train/validation/test sets. |
| Hardware Specification | Yes | Our timing experiments were run using Python/Jupyter notebook on a linux VM with 8 cores and 32 Gb of RAM |
| Software Dependencies | No | The paper mentions "Python/Jupyter notebook" and "randomized_svd function from scikit.learn" but does not provide specific version numbers for these software components, which is required for reproducibility. |
| Experiment Setup | Yes | Input: Choice of k, sketch size ℓ for Frequent Directions [26]; Input: Choice of k, random projection matrix R Rd ℓ; We ran experiments with a range of ℓs, in the range (2k, 20k) for each dataset |