Binary Partitions with Approximate Minimum Impurity
Authors: Eduardo Laber, Marco Molinaro, Felipe Mello Pereira
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We also report experiments that provide evidence that the proposed methods are interesting candidates to be employed in splitting nominal attributes with many values during decision tree/random forest induction. To complement our theoretical findings, in Section 5 we present a set of experiments where we compare the proposed methods with PCext and SLIQext. |
| Researcher Affiliation | Academia | 1Departamento de Inform atica, PUCRIO, Brazil. Correspondence to: Eduardo Laber <eduardo.laber1@gmail.com>. |
| Pseudocode | Yes | Algorithm 1 Bd (V : collection of vectors, I: impurity measure) 1: For each v in V let r(v) = (d v)/ v 1 2: Rank the vectors in V according to r(v) 3: for j = 1, . . . , n 1 do 4: Pj subset of V containing the j vectors with the largest value of r( ) 5: Evaluate the impurity of partition (Pj, V \ Pj) 6: end for 7: Return the partition (Pj , V \ Pj ) with the smallest impurity found in the loop |
| Open Source Code | Yes | The algorithms were implemented in Python 3 using numpy and and are available in https://github.com/felipeamp/icml-2018. |
| Open Datasets | Yes | We employed 11 datasets in total. Eight of them are from the UCI repository: Mushroom, KDD98, Adult, Nursery, Covertype, Cars, Contraceptive and Poker (Lichman, 2013). Two others are available in Kaggle: San Francisco Crime and Shelter Animal Outcome (SF-Open Data; Austin Animal-Center). |
| Dataset Splits | Yes | We employed a 95% one-tailed paired t-student test to compare the accuracy attained by the methods over 20 3-fold stratified cross-validations. |
| Hardware Specification | Yes | All the experiments were executed on a PC Intel i7-6500U CPU with 2.5GHz and 8GB of RAM. |
| Software Dependencies | No | The algorithms were implemented in Python 3 using numpy and and are available in https://github.com/felipeamp/icml-2018. While Python 3 is mentioned, a specific version for numpy is not provided. |
| Experiment Setup | Yes | All experiments are Monte Carlo simulations with 10,000 runs, each using a randomly-generated contingency table for the given number of values n and classes k. Each table was created by uniformly picking a number in {0, . . . , 7} for each entry. We build decision trees with depth at most 16. |