Online Isolation Forest

Authors: Filippo Leveni, Guilherme Weigert Cassales, Bernhard Pfahringer, Albert Bifet, Giacomo Boracchi

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental validation on real-world datasets demonstrated that ONLINE-IFOREST is on par with online alternatives and closely rivals state-of-the-art offline anomaly detection techniques that undergo periodic retraining. Notably, ONLINE-IFOREST consistently outperforms all competitors in terms of efficiency, making it a promising solution in applications where fast identification of anomalies is of primary importance such as cybersecurity, fraud and fault detection.
Researcher Affiliation Academia 1Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milan, Italy 2Artificial Intelligence Institute, University of Waikato, Hamilton, New Zealand.
Pseudocode Yes Algorithm 1: ONLINE-IFOREST, Algorithm 2: ONLINE-ITREE learn point, Algorithm 3: ONLINE-ITREE forget point, Algorithm 4: ONLINE-ITREE point depth
Open Source Code Yes The code of our method is publicly available at https://github.com/ineve Loppili F/ Online-Isolation-Forest.
Open Datasets Yes We run our experiments on the eight largest datasets used in (Liu et al., 2008; 2012) (Http, Smtp (Yamanishi et al., 2004), Annthyroid, Forest Cover Type, Satellite, Shuttle (Asunction & Newman, 2007), Mammography and Mulcross (Rocke & Woodruff, 1996)), two datasets from Kaggle competitions (Donors and Fraud (Pang et al., 2019)), and the shingled version of NYC Taxicab dataset used in (Guha et al., 2016).
Dataset Splits No The paper mentions shuffling datasets and using them for testing, but it does not specify explicit percentages or sample counts for training, validation, or test splits. For example, it does not state
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions using the "Autorank (Herbold, 2020) library" but does not specify version numbers for Autorank or any other core software libraries (e.g., Python, PyTorch/TensorFlow) that would be needed for reproduction.
Experiment Setup Yes For comparison purposes, we set the number of trees τ = 32 for all the algorithms, and considered the number of random cuts in LODA equivalent to the number of trees. We set window size ω = 2048 for both o IFOR and asd IFOR, and used the default value ω = 250 for HST. We set the subsampling size used to build trees in asd IFOR to the default value ψ = 256, while the number of bins for each random projection in LODA to b = 100. The trees maximum depth δ depends on the subsamping size ψ in asd IFOR, on the window size ω and number η of points required to split histogram bins in o IFOR, while it is fixed to the default value δ = 15 in HST. The parameters configuration for all the algorithms is illustrated in Table 3.