Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Complete Search for Feature Selection in Decision Trees

Authors: Salvatore Ruggieri

JMLR 2019 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We investigate experimentally the properties and limitations of the procedures on a collection of 20 benchmark datasets, showing that oversearching increases both overﬁtting and instability.
Researcher Affiliation	Academia	Salvatore Ruggieri EMAIL Department of Computer Science University of Pisa Largo B. Pontecorvo 3, 56127, Pisa, Italy
Pseudocode	Yes	Algorithm 1 subset(R, S) enumerates R 1 Pow(S) 1: output R S 4: for ai S do 5: R R \ {ai} 6: subset(R , S ) 7: S S {ai} Algorithm 2 DTdistinct(R, S) enumerates distinct decision trees using feature subsets in R 1 Pow(S).
Open Source Code	Yes	Both DTacceptδ and DTsbe are implemented in a multi-core data parallel C++ system, which is made publicly available.
Open Datasets	Yes	We perform experiments on 20 small and large dimensional benchmark datasets publicly available from the UCI ML repository (Lichman, 2013).
Dataset Splits	Yes	Cross-validation is repeated 10 times. At each repetition, the available dataset is split into 10 folds, using stratiﬁed random sampling. Each fold is used to compute the misclassiﬁcation error of the classiﬁer built on the remaining 9 folds used as training set for building classiﬁcation models. The generalization error is then the average misclassiﬁcation error over the 100 classiﬁcation models (10 models times 10 repetitions). The training set is split into 70% building set and 30% search set using stratiﬁed random sampling.
Hardware Specification	Yes	Tests were performed on a PC with Intel 8 cores i7-6900K at 3.7 GHz, without hyperthreading, 16 Gb RAM, and Windows Server 2016 OS.
Software Dependencies	No	All procedures described in this paper are implemented by extending the Ya DT system (Ruggieri, 2002, 2004; Aldinucci et al., 2014). It is a state-of-the-art main-memory C++ implementation of C4.5 with many algorithmic and data structure optimizations as well as with multi-core data parallelism in tree building. The extended Ya DT version is publicly available from the author s home page11. OCT parameters: 8 core processes in Julia, max depth = 5, other parameters set to default (min bucket = 1, local search = true, cp = 0.01).
Experiment Setup	Yes	Information Gain (IG) is used as quality measure in node splitting during tree construction. For all datasets, the tree building stopping parameter m is set to the small value 2. RF: a random forest of 100 decision trees. OCT parameters: 8 core processes in Julia, max depth = 5, other parameters set to default (min bucket = 1, local search = true, cp = 0.01).