reproducibilityindex.ai

When do random forests fail?

Authors: Cheng Tang, Damien Garreau, Ulrike von Luxburg

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	To facilitate our theoretical investigation, we restrict our analysis to unsupervised random forests, that is, random forests whose tree construction does not use label information (Def. 2). We establish two conditions, diversity and locality (Def. 3 and 4), that are necessary for a forest to be consistent. We then examine how parameter tuning affects diversity and locality. Our results highlight the importance of subsampling data points during the tree construction phase: Without subsampling, forests of deep trees can become inconsistent due to violation of diversity; on the other hand, if we subsample too heavily, forests can also become inconsistent due to violation of locality. Our analysis implies two surprising consequences as special cases: (1) When considering partitioning trees that are particularly good for nearest-neighbor search, such as random projection trees, it is natural to expect them to be also good for random forests. Our results disagree with this intuition: Unless we use severe subsampling, they lead to inconsistent forests. (2) In a popular variant of random forests, extremely randomized trees are used and subsampling is disabled (Geurts et al., 2006). The argument in that paper is that when forests use extremely randomized trees, the randomness in the trees already reduces variance and thus subsampling becomes unnecessary. Our results suggest otherwise.
Researcher Affiliation	Academia	Cheng Tang George Washington University Washington, DC tangch@gwu.edu Damien Garreau Max Planck Institute for Intelligent Systems T ubingen, Germany damien.garreau@tuebingen.mpg.de Ulrike von Luxburg University of T ubingen Max Planck Institute for Intelligent Systems T ubingen, Germany luxburg@informatik.uni-tuebingen.de
Pseudocode	Yes	Algorithm 1 Randomized Projection Tree Input: Sample S, maximum leaf size n0; Output: T = RPT(S, n0); 1: T empty tree; 2: if \|S\| > n0 then 3: Sample U uniformly from Sd 1; 4: Sample q uniformly from 1 4 ; 5: tq empirical q-th quantile of U T S; 6: SL {x S : U T x tq}; 7: T.graft (RPT(SL, n0)); 8: SR S \ SL; 9: T.graft (RPT(SR, n0)); 10: end if; Algorithm 2 Randomized Spill Tree Input: S, n0, α (0, 1/2); Output: T = RST(S, n0, α); 1: T empty tree; 2: if \|S\| > n0 then 3: Sample U uniformly from Sd 1; 4: t L top 1 2 + α-quantile of U T S; 5: t R bottom 1 2 + α-quantile of U T S; 6: SL {x S : U T x t L}; 7: T.graft (RST(SL, n0, α)); 8: SR {x S : U T x t R}; 9: T.graft(RST(SR, n0, α)); 10: end if
Open Source Code	No	The paper does not include any statement or link indicating that source code for the methodology is openly available.
Open Datasets	No	The paper is theoretical and does not describe experiments using specific, named public datasets with access information.
Dataset Splits	No	The paper is theoretical and does not discuss dataset splits for training, validation, or testing for experimental reproduction.
Hardware Specification	No	The paper is theoretical and does not mention any specific hardware used for experiments.
Software Dependencies	No	The paper is theoretical and does not list any specific software dependencies or versions.
Experiment Setup	No	The paper is theoretical and does not provide specific experimental setup details like hyperparameters or training configurations.