Online Isolation Forest
Authors: Filippo Leveni, Guilherme Weigert Cassales, Bernhard Pfahringer, Albert Bifet, Giacomo Boracchi
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental validation on real-world datasets demonstrated that ONLINE-IFOREST is on par with online alternatives and closely rivals state-of-the-art offline anomaly detection techniques that undergo periodic retraining. Notably, ONLINE-IFOREST consistently outperforms all competitors in terms of efficiency, making it a promising solution in applications where fast identification of anomalies is of primary importance such as cybersecurity, fraud and fault detection. |
| Researcher Affiliation | Academia | 1Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milan, Italy 2Artificial Intelligence Institute, University of Waikato, Hamilton, New Zealand. |
| Pseudocode | Yes | Algorithm 1: ONLINE-IFOREST, Algorithm 2: ONLINE-ITREE learn point, Algorithm 3: ONLINE-ITREE forget point, Algorithm 4: ONLINE-ITREE point depth |
| Open Source Code | Yes | The code of our method is publicly available at https://github.com/ineve Loppili F/ Online-Isolation-Forest. |
| Open Datasets | Yes | We run our experiments on the eight largest datasets used in (Liu et al., 2008; 2012) (Http, Smtp (Yamanishi et al., 2004), Annthyroid, Forest Cover Type, Satellite, Shuttle (Asunction & Newman, 2007), Mammography and Mulcross (Rocke & Woodruff, 1996)), two datasets from Kaggle competitions (Donors and Fraud (Pang et al., 2019)), and the shingled version of NYC Taxicab dataset used in (Guha et al., 2016). |
| Dataset Splits | No | The paper mentions shuffling datasets and using them for testing, but it does not specify explicit percentages or sample counts for training, validation, or test splits. For example, it does not state |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using the "Autorank (Herbold, 2020) library" but does not specify version numbers for Autorank or any other core software libraries (e.g., Python, PyTorch/TensorFlow) that would be needed for reproduction. |
| Experiment Setup | Yes | For comparison purposes, we set the number of trees τ = 32 for all the algorithms, and considered the number of random cuts in LODA equivalent to the number of trees. We set window size ω = 2048 for both o IFOR and asd IFOR, and used the default value ω = 250 for HST. We set the subsampling size used to build trees in asd IFOR to the default value ψ = 256, while the number of bins for each random projection in LODA to b = 100. The trees maximum depth δ depends on the subsamping size ψ in asd IFOR, on the window size ω and number η of points required to split histogram bins in o IFOR, while it is fixed to the default value δ = 15 in HST. The parameters configuration for all the algorithms is illustrated in Table 3. |