Robust Random Cut Forest Based Anomaly Detection on Streams

Authors: Sudipto Guha, Nina Mishra, Gourav Roy, Okke Schrijvers

ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In the experiments, we focus on datasets where anomalies are visual, verifiable and interpretable. We begin with a synthetic dataset that captures the classic diurnal rhythm of human activity. We then move to a real dataset reflecting taxi ridership in New York City. In both cases, we compare the performance of RRCF with IF.
Researcher Affiliation Collaboration Sudipto Guha SUDIPTO@CIS.UPENN.EDU University of Pennsylvania, Philadelphia, PA 19104. Nina Mishra NMISHRA@AMAZON.COM Amazon, Palo Alto, CA 94303. Gourav Roy GOURAVR@AMAZON.COM Amazon, Bangalore, India 560055. Okke Schrijvers OKKES@CS.STANFORD.EDU Stanford University, Palo Alto, CA 94305.
Pseudocode Yes Algorithm 1 Algorithm Forget Point. ... Algorithm 2 Algorithm Insert Point.
Open Source Code No The paper does not contain any statements about releasing open-source code or provide links to a code repository for the methodology described.
Open Datasets Yes Next we conduct a streaming experiment using taxi ridership data from the NYC Taxi Commission2. ... 2http://www.nyc.gov/html/tlc/html/about/trip record data.shtml
Dataset Splits Yes We learn a threshold for a good score on a training set and report the effectiveness on a held out test set. The training set contains all points before time t and the test set all points after time t.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies No The paper does not list specific software dependencies with version numbers (e.g., Python, PyTorch, scikit-learn versions).
Experiment Setup Yes The experiments were run with a shingle of length four, and one hundred trees in the forest, where each tree is constructed with a uniform random reservoir sample of 256 points. ... In the experiments, there were 200 trees in the forest, each computed based on a random sample of 1K points. ... we set our time-decayed sampling parameter to the last two months of ridership.