Symbolic Regression Enhanced Decision Trees for Classification Tasks

Authors: Kei Sen Fong, Mehul Motani

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate SREDT on both synthetic and real-world datasets. Despite its simplicity, our method produces surprisingly small trees that outperform both DT and oblique DT (ODT) on supervised classification tasks in terms of accuracy and F-score.
Researcher Affiliation Academia Kei Sen Fong1, Mehul Motani1, 2 1Department of Electrical and Computer Engineering, National University of Singapore 2N.1 Institute for Health, Institute for Digital Medicine (Wis DM), Institute of Data Science, National University of Singapore
Pseudocode Yes Algorithm 1: Symbolic Regressor Pseudo Code Input: N, where N = set of classified instances, with D features, Criteria Scoring (e.g., Mean Squared Error) , population size, generations Output: Best Individual 1 population Initialize Population(population size) 2 for gen 1 to generations do 3 population Evaluate Fitness(population, N, Criteria Scoring) 4 population Select Parents(population) 5 population Crossover(population) 6 population Mutate(population) 8 return Best Individual(population)
Open Source Code No The paper does not provide any explicit statement about open-sourcing the code for the methodology or provide a link to a code repository.
Open Datasets Yes We demonstrate through experiments the ability of SREDT to solve 3 public synthetic classification problems from MLxtend (Raschka 2018) XOR, Half-Moons and Concentric Circles. To evaluate the performance of SREDT on real-life datasets, we extensively study 6 commonly used tabular classification datasets: the Cancer, Diabetes, Forest Type Mapping, Heart Disease, Iris and Raisin datasets (Dua and Graff 2019; Smith et al. 1988; Ali 2020). Further, we ran local SREDT along with our original SREDT, ODT and DTs of 2 different depths on 56 different PMLB datasets (Olson et al. (2017), dataset details in Supplementary Appendix D).
Dataset Splits Yes Performance is evaluated on a 60-10-30 random train-validation-test split and the hyper-parameters of the other models are tuned using the validation set.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments, such as GPU or CPU models.
Software Dependencies No The paper mentions software like Python, MLxtend, Py Caret, Ada Boost, Cat Boost, etc., but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes We set our SR hyperparameters as follows: the parsimony coefficient to 0.001, number of generations to 40, population size to 400 and tournament size to 200. Gini impurity is used as the Criteria Scoring function. SREDT is given a max depth of log2(n)+1, where n is the number of unique labels.