Robust Random Cut Forest Based Anomaly Detection on Streams
Authors: Sudipto Guha, Nina Mishra, Gourav Roy, Okke Schrijvers
ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In the experiments, we focus on datasets where anomalies are visual, verifiable and interpretable. We begin with a synthetic dataset that captures the classic diurnal rhythm of human activity. We then move to a real dataset reflecting taxi ridership in New York City. In both cases, we compare the performance of RRCF with IF. |
| Researcher Affiliation | Collaboration | Sudipto Guha SUDIPTO@CIS.UPENN.EDU University of Pennsylvania, Philadelphia, PA 19104. Nina Mishra NMISHRA@AMAZON.COM Amazon, Palo Alto, CA 94303. Gourav Roy GOURAVR@AMAZON.COM Amazon, Bangalore, India 560055. Okke Schrijvers OKKES@CS.STANFORD.EDU Stanford University, Palo Alto, CA 94305. |
| Pseudocode | Yes | Algorithm 1 Algorithm Forget Point. ... Algorithm 2 Algorithm Insert Point. |
| Open Source Code | No | The paper does not contain any statements about releasing open-source code or provide links to a code repository for the methodology described. |
| Open Datasets | Yes | Next we conduct a streaming experiment using taxi ridership data from the NYC Taxi Commission2. ... 2http://www.nyc.gov/html/tlc/html/about/trip record data.shtml |
| Dataset Splits | Yes | We learn a threshold for a good score on a training set and report the effectiveness on a held out test set. The training set contains all points before time t and the test set all points after time t. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers (e.g., Python, PyTorch, scikit-learn versions). |
| Experiment Setup | Yes | The experiments were run with a shingle of length four, and one hundred trees in the forest, where each tree is constructed with a uniform random reservoir sample of 256 points. ... In the experiments, there were 200 trees in the forest, each computed based on a random sample of 1K points. ... we set our time-decayed sampling parameter to the last two months of ridership. |