Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Complete Search for Feature Selection in Decision Trees

Authors: Salvatore Ruggieri

JMLR 2019 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We investigate experimentally the properties and limitations of the procedures on a collection of 20 benchmark datasets, showing that oversearching increases both overfitting and instability.
Researcher Affiliation Academia Salvatore Ruggieri EMAIL Department of Computer Science University of Pisa Largo B. Pontecorvo 3, 56127, Pisa, Italy
Pseudocode Yes Algorithm 1 subset(R, S) enumerates R 1 Pow(S) 1: output R S 4: for ai S do 5: R R \ {ai} 6: subset(R , S ) 7: S S {ai} Algorithm 2 DTdistinct(R, S) enumerates distinct decision trees using feature subsets in R 1 Pow(S).
Open Source Code Yes Both DTacceptδ and DTsbe are implemented in a multi-core data parallel C++ system, which is made publicly available.
Open Datasets Yes We perform experiments on 20 small and large dimensional benchmark datasets publicly available from the UCI ML repository (Lichman, 2013).
Dataset Splits Yes Cross-validation is repeated 10 times. At each repetition, the available dataset is split into 10 folds, using stratified random sampling. Each fold is used to compute the misclassification error of the classifier built on the remaining 9 folds used as training set for building classification models. The generalization error is then the average misclassification error over the 100 classification models (10 models times 10 repetitions). The training set is split into 70% building set and 30% search set using stratified random sampling.
Hardware Specification Yes Tests were performed on a PC with Intel 8 cores i7-6900K at 3.7 GHz, without hyperthreading, 16 Gb RAM, and Windows Server 2016 OS.
Software Dependencies No All procedures described in this paper are implemented by extending the Ya DT system (Ruggieri, 2002, 2004; Aldinucci et al., 2014). It is a state-of-the-art main-memory C++ implementation of C4.5 with many algorithmic and data structure optimizations as well as with multi-core data parallelism in tree building. The extended Ya DT version is publicly available from the author s home page11. OCT parameters: 8 core processes in Julia, max depth = 5, other parameters set to default (min bucket = 1, local search = true, cp = 0.01).
Experiment Setup Yes Information Gain (IG) is used as quality measure in node splitting during tree construction. For all datasets, the tree building stopping parameter m is set to the small value 2. RF: a random forest of 100 decision trees. OCT parameters: 8 core processes in Julia, max depth = 5, other parameters set to default (min bucket = 1, local search = true, cp = 0.01).