reproducibilityindex.ai

Accurate Robust and Efficient Error Estimation for Decision Trees

Authors: Lixin Fan

ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experimental results show that the proposed error estimate is superior to the well-known K-fold cross validation methods in terms of robustness and accuracy. Moreover it is orders of magnitudes more efﬁcient than cross validation methods.
Researcher Affiliation	Industry	Lixin Fan LIXIN.FAN@NOKIA.COM Nokia Technologies, Valtatie 30, Tampere, Finland
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any concrete access to source code for the methodology described, such as a repository link or an explicit code release statement.
Open Datasets	Yes	Table 4.1 below summarizes ten benchmark datasets from UCI Machine Learning Repository. Each dataset is used to train a decision tree classiﬁer using an optimised version of the CART algorithm (Pedregosa et al., 2011). Corresponding generalization errors are measured/estimated using four benchmarked approaches... 1http://archive.ics.uci.edu/ml/
Dataset Splits	Yes	The whole set of data samples is randomly separated into training and testing subsets and the separation repeats multiple times to average out generalization errors. More speciﬁcally let separation ratio r equal #training samples / #dataset samples and for each ﬁxed ratio r = {0.1, 0.3, 0.5, 0.7, 0.9}, a dataset is randomly separated 50 times and thus all together there are 250 randomly separations. Then for each separation a decision tree is constructed using training samples (see below explanation concerning cross validation methods). Finally for each constructed tree, generalization errors are measured (or estimated) at tree nodes with different depths ranging from 0 to maximal depth where depth 0 corresponds to root node. Indeed for K-fold cross validation the subset of training samples is randomly separated into CV training and CV validation subsets. Decision trees are constructed using CV training samples and generalization errors are estimated using CV validation samples. The process repeats K times to average out the generalization error where K = {2, 5, 10} in our experiments.
Hardware Specification	No	The paper does not provide any specific hardware details (such as GPU or CPU models, or cloud resources with specifications) used for running its experiments.
Software Dependencies	No	The paper mentions 'an optimised version of the CART algorithm (Pedregosa et al., 2011)', which refers to the scikit-learn library, but it does not specify any software versions for this or other dependencies.
Experiment Setup	No	The paper describes dataset splitting protocols (e.g., separation ratio r, K-fold cross-validation), but it does not specify concrete hyperparameters for the decision tree learning algorithm itself (e.g., maximum depth, minimum samples per leaf) or other specific training configurations.