When do random forests fail?

Authors: Cheng Tang, Damien Garreau, Ulrike von Luxburg

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical To facilitate our theoretical investigation, we restrict our analysis to unsupervised random forests, that is, random forests whose tree construction does not use label information (Def. 2). We establish two conditions, diversity and locality (Def. 3 and 4), that are necessary for a forest to be consistent. We then examine how parameter tuning affects diversity and locality. Our results highlight the importance of subsampling data points during the tree construction phase: Without subsampling, forests of deep trees can become inconsistent due to violation of diversity; on the other hand, if we subsample too heavily, forests can also become inconsistent due to violation of locality. Our analysis implies two surprising consequences as special cases: (1) When considering partitioning trees that are particularly good for nearest-neighbor search, such as random projection trees, it is natural to expect them to be also good for random forests. Our results disagree with this intuition: Unless we use severe subsampling, they lead to inconsistent forests. (2) In a popular variant of random forests, extremely randomized trees are used and subsampling is disabled (Geurts et al., 2006). The argument in that paper is that when forests use extremely randomized trees, the randomness in the trees already reduces variance and thus subsampling becomes unnecessary. Our results suggest otherwise.
Researcher Affiliation Academia Cheng Tang George Washington University Washington, DC tangch@gwu.edu Damien Garreau Max Planck Institute for Intelligent Systems T ubingen, Germany damien.garreau@tuebingen.mpg.de Ulrike von Luxburg University of T ubingen Max Planck Institute for Intelligent Systems T ubingen, Germany luxburg@informatik.uni-tuebingen.de
Pseudocode Yes Algorithm 1 Randomized Projection Tree Input: Sample S, maximum leaf size n0; Output: T = RPT(S, n0); 1: T empty tree; 2: if |S| > n0 then 3: Sample U uniformly from Sd 1; 4: Sample q uniformly from 1 4 ; 5: tq empirical q-th quantile of U T S; 6: SL {x S : U T x tq}; 7: T.graft (RPT(SL, n0)); 8: SR S \ SL; 9: T.graft (RPT(SR, n0)); 10: end if; Algorithm 2 Randomized Spill Tree Input: S, n0, α (0, 1/2); Output: T = RST(S, n0, α); 1: T empty tree; 2: if |S| > n0 then 3: Sample U uniformly from Sd 1; 4: t L top 1 2 + α-quantile of U T S; 5: t R bottom 1 2 + α-quantile of U T S; 6: SL {x S : U T x t L}; 7: T.graft (RST(SL, n0, α)); 8: SR {x S : U T x t R}; 9: T.graft(RST(SR, n0, α)); 10: end if
Open Source Code No The paper does not include any statement or link indicating that source code for the methodology is openly available.
Open Datasets No The paper is theoretical and does not describe experiments using specific, named public datasets with access information.
Dataset Splits No The paper is theoretical and does not discuss dataset splits for training, validation, or testing for experimental reproduction.
Hardware Specification No The paper is theoretical and does not mention any specific hardware used for experiments.
Software Dependencies No The paper is theoretical and does not list any specific software dependencies or versions.
Experiment Setup No The paper is theoretical and does not provide specific experimental setup details like hyperparameters or training configurations.