Random Forest Density Estimation

Authors: Hongwei Wen, Hanyuan Hang

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The contributions of this paper come from the theoretical and experimental perspectives: (i) We propose a tree-based density estimation algorithm called random forest density estimation (RFDE), which not only alleviates the problem of boundary discontinuity long plaguing partition-based methods, but also enjoys high computational efficiency. (iii) From a learning theory point of view, we prove the fast convergence rates of RFDE with assumptions that the underlying density functions lie in the Hölder space C0,α. (iv) In experiments, we validate the theoretical results and evaluate our RFDE through comparisons on both synthetic and real data. Moreover, we evaluate our RFDE through the problem of anomaly detection as a possible application.
Researcher Affiliation Academia 1Department of Applied Mathematics, University of Twente, Enschede, The Netherlands. Correspondence to: Hongwei Wen <h.wen@utwente.nl>.
Pseudocode Yes Algorithm 1 Random Tree Partition Input: Depth of the random tree p; Initial partition π0 = {A1 0 := Br}. for i = 1 to p do for j = 1 to 2i 1 do For rectangular cell Aj i 1, randomly choose one dimension coordinate Zi,j whose probability distribution is given by (1); Divide the cell Aj i 1 into two subregions, that is, Aj i 1 = A2j 1 i A2j i , along the midpoint of the dimension Zi,j; end for Get πi = {Aj i, 1 j 2i}. end for Output: Random tree partition πp.
Open Source Code No The paper does not include an unambiguous statement that the authors are releasing the code for the work described in this paper, nor does it provide a direct link to a source-code repository.
Open Datasets Yes We conduct numerical comparisons on real datasets from the UCI repository (Dua & Graff, 2017). We put the detailed description of datasets in Section C.2 of the appendix. [...] In the following experiments, we generate 2,000 and 10,000 i.i.d samples as training and testing data respectively from each type of synthetic datasets, and each with dimension d {2, 5, 7}.
Dataset Splits Yes The two hyper-parameters p are chosen from {1, 2, . . . , 15}, respectively, by 3-fold cross-validation. We repeat this procedure 10 times to evaluate the standard deviation for ANLL.
Hardware Specification No The paper does not specify any hardware details such as GPU/CPU models, processors, or memory used for running the experiments.
Software Dependencies No The paper mentions using scikit-learn for anomaly detection algorithm implementations but does not specify its version or any other software versions.
Experiment Setup Yes The number of iterations T is set to be 100 and the two hyper-parameters p are chosen from {1, 2, . . . , 15}, respectively, by 3-fold cross-validation. [...] We pick the optimal p from 1 to 15. For each T we repeat this procedure 10 times. [...] We select T = 500 to make the density estimator converge with the sufficient base learners.