PIDForest: Anomaly Detection via Partial Identification
Authors: Parikshit Gopalan, Vatsal Sharan, Udi Wieder
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present extensive experiments on real and synthetic data sets showing that our algorithm consistently outperforms or matches six popular anomaly detection algorithms. PIDForest is the top performing algorithm in 6 out of 12 benchmark real-world datasets, while no other algorithm is the best in more than 3. PIDForest is also resilient to noise and irrelevant attributes. These results are in Section 4 and 5. |
| Researcher Affiliation | Collaboration | Parikshit Gopalan VMware Research pgopalan@vmware.com Vatsal Sharan Stanford University vsharan@stanford.edu Udi Wieder VMware Research uwieder@vmware.com |
| Pseudocode | Yes | PIDForest Fit Params: Num of trees t, Samples m, Max degree k, Max depth h. Repeat t times: Create root node v. Let C(v) = [0, 1]d, P(v) T be a random subset of size m . Split(v) Split(v): For j [d], compute the best split into k intervals. Pick j that maximizes variance, split C along j into {Ci}k i=1. For i [k] create child vi s.t. C(vi) = Ci, P(vi) = P(v) Ci. If depth(vi) h and |P(vi)| > 1 then Split(vi). Else, set PIDScore(vi) = vol(C(vi))/|P(vi)|. |
| Open Source Code | Yes | The code and data for all experiments is available online.2 Footnote 2: https://github.com/vatsalsharan/pidforest |
| Open Datasets | Yes | The first set of datasets are classification datasets from the UCI [16] and open ML repository [17] (they are also available at [18]). [...] The next set of real-world datasets NYC taxicab, CPU utilization, Machine temperature (M.T.) and Ambient temperature (A.T.) are time series datasets from the Numenta anomaly detection benchmark which have been hand-labeled with anomalies rooted in real-world causes [21]. |
| Dataset Splits | No | The paper lists various datasets used for experiments but does not provide explicit details on how these datasets were split into training, validation, and testing sets, nor does it specify percentages or sample counts for such splits. |
| Hardware Specification | No | The paper mentions that its 'vanilla Python implementation on a laptop computer only takes about 5 minutes', but it does not provide specific hardware details such as CPU or GPU models, or memory specifications. |
| Software Dependencies | No | The paper states that PIDForest is implemented in Python and that other algorithms use scikit-learn, PyOD, and an RRCF implementation, but it does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | For PIDForest, we fix the hyperparameters of depth to 10, number of trees to 50, and the number of samples used to build each tree to 100. [...] For RRCF, we use 500 trees instead of the default 100 since it yielded significantly better performance. |