reproducibilityindex.ai

PIDForest: Anomaly Detection via Partial Identification

Authors: Parikshit Gopalan, Vatsal Sharan, Udi Wieder

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present extensive experiments on real and synthetic data sets showing that our algorithm consistently outperforms or matches six popular anomaly detection algorithms. PIDForest is the top performing algorithm in 6 out of 12 benchmark real-world datasets, while no other algorithm is the best in more than 3. PIDForest is also resilient to noise and irrelevant attributes. These results are in Section 4 and 5.
Researcher Affiliation	Collaboration	Parikshit Gopalan VMware Research pgopalan@vmware.com Vatsal Sharan Stanford University vsharan@stanford.edu Udi Wieder VMware Research uwieder@vmware.com
Pseudocode	Yes	PIDForest Fit Params: Num of trees t, Samples m, Max degree k, Max depth h. Repeat t times: Create root node v. Let C(v) = [0, 1]d, P(v) T be a random subset of size m . Split(v) Split(v): For j [d], compute the best split into k intervals. Pick j that maximizes variance, split C along j into {Ci}k i=1. For i [k] create child vi s.t. C(vi) = Ci, P(vi) = P(v) Ci. If depth(vi) h and \|P(vi)\| > 1 then Split(vi). Else, set PIDScore(vi) = vol(C(vi))/\|P(vi)\|.
Open Source Code	Yes	The code and data for all experiments is available online.2 Footnote 2: https://github.com/vatsalsharan/pidforest
Open Datasets	Yes	The first set of datasets are classification datasets from the UCI [16] and open ML repository [17] (they are also available at [18]). [...] The next set of real-world datasets NYC taxicab, CPU utilization, Machine temperature (M.T.) and Ambient temperature (A.T.) are time series datasets from the Numenta anomaly detection benchmark which have been hand-labeled with anomalies rooted in real-world causes [21].
Dataset Splits	No	The paper lists various datasets used for experiments but does not provide explicit details on how these datasets were split into training, validation, and testing sets, nor does it specify percentages or sample counts for such splits.
Hardware Specification	No	The paper mentions that its 'vanilla Python implementation on a laptop computer only takes about 5 minutes', but it does not provide specific hardware details such as CPU or GPU models, or memory specifications.
Software Dependencies	No	The paper states that PIDForest is implemented in Python and that other algorithms use scikit-learn, PyOD, and an RRCF implementation, but it does not provide specific version numbers for any of these software dependencies.
Experiment Setup	Yes	For PIDForest, we ﬁx the hyperparameters of depth to 10, number of trees to 50, and the number of samples used to build each tree to 100. [...] For RRCF, we use 500 trees instead of the default 100 since it yielded signiﬁcantly better performance.