Binary Partitions with Approximate Minimum Impurity

Authors: Eduardo Laber, Marco Molinaro, Felipe Mello Pereira

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We also report experiments that provide evidence that the proposed methods are interesting candidates to be employed in splitting nominal attributes with many values during decision tree/random forest induction. To complement our theoretical findings, in Section 5 we present a set of experiments where we compare the proposed methods with PCext and SLIQext.
Researcher Affiliation Academia 1Departamento de Inform atica, PUCRIO, Brazil. Correspondence to: Eduardo Laber <eduardo.laber1@gmail.com>.
Pseudocode Yes Algorithm 1 Bd (V : collection of vectors, I: impurity measure) 1: For each v in V let r(v) = (d v)/ v 1 2: Rank the vectors in V according to r(v) 3: for j = 1, . . . , n 1 do 4: Pj subset of V containing the j vectors with the largest value of r( ) 5: Evaluate the impurity of partition (Pj, V \ Pj) 6: end for 7: Return the partition (Pj , V \ Pj ) with the smallest impurity found in the loop
Open Source Code Yes The algorithms were implemented in Python 3 using numpy and and are available in https://github.com/felipeamp/icml-2018.
Open Datasets Yes We employed 11 datasets in total. Eight of them are from the UCI repository: Mushroom, KDD98, Adult, Nursery, Covertype, Cars, Contraceptive and Poker (Lichman, 2013). Two others are available in Kaggle: San Francisco Crime and Shelter Animal Outcome (SF-Open Data; Austin Animal-Center).
Dataset Splits Yes We employed a 95% one-tailed paired t-student test to compare the accuracy attained by the methods over 20 3-fold stratified cross-validations.
Hardware Specification Yes All the experiments were executed on a PC Intel i7-6500U CPU with 2.5GHz and 8GB of RAM.
Software Dependencies No The algorithms were implemented in Python 3 using numpy and and are available in https://github.com/felipeamp/icml-2018. While Python 3 is mentioned, a specific version for numpy is not provided.
Experiment Setup Yes All experiments are Monte Carlo simulations with 10,000 runs, each using a randomly-generated contingency table for the given number of values n and classes k. Each table was created by uniformly picking a number in {0, . . . , 7} for each entry. We build decision trees with depth at most 16.