Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Robust Random Cut Forest Based Anomaly Detection on Streams
Authors: Sudipto Guha, Nina Mishra, Gourav Roy, Okke Schrijvers
ICML 2016 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In the experiments, we focus on datasets where anomalies are visual, verifiable and interpretable. We begin with a synthetic dataset that captures the classic diurnal rhythm of human activity. We then move to a real dataset reflecting taxi ridership in New York City. In both cases, we compare the performance of RRCF with IF. |
| Researcher Affiliation | Collaboration | Sudipto Guha EMAIL University of Pennsylvania, Philadelphia, PA 19104. Nina Mishra EMAIL Amazon, Palo Alto, CA 94303. Gourav Roy EMAIL Amazon, Bangalore, India 560055. Okke Schrijvers EMAIL Stanford University, Palo Alto, CA 94305. |
| Pseudocode | Yes | Algorithm 1 Algorithm Forget Point. ... Algorithm 2 Algorithm Insert Point. |
| Open Source Code | No | The paper does not contain any statements about releasing open-source code or provide links to a code repository for the methodology described. |
| Open Datasets | Yes | Next we conduct a streaming experiment using taxi ridership data from the NYC Taxi Commission2. ... 2http://www.nyc.gov/html/tlc/html/about/trip record data.shtml |
| Dataset Splits | Yes | We learn a threshold for a good score on a training set and report the effectiveness on a held out test set. The training set contains all points before time t and the test set all points after time t. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers (e.g., Python, PyTorch, scikit-learn versions). |
| Experiment Setup | Yes | The experiments were run with a shingle of length four, and one hundred trees in the forest, where each tree is constructed with a uniform random reservoir sample of 256 points. ... In the experiments, there were 200 trees in the forest, each computed based on a random sample of 1K points. ... we set our time-decayed sampling parameter to the last two months of ridership. |