Speeding up Very Fast Decision Tree with Low Computational Cost

Authors: Jian Sun, Hongyu Jia, Bo Hu, Xiao Huang, Hao Zhang, Hai Wan, Xibin Zhao

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments are conducted using multiple synthetic and real datasets.
Researcher Affiliation Academia 1 KLISS, BNRist, School of Software, Tsinghua University, China 2 Beijing University of Posts and Telecommunications, China {sunj17, jiahy19}@mails.tsinghua.edu.cn, huboniccolo@bupt.edu.cn, {huangx19, hao-zhan17}@mails.tsinghua.edu.cn, {wanhai, zxb}@tsinghua.edu.cn
Pseudocode Yes Algorithm 1 Online Decision Tree Induction; Function 2 Attempt To Split(l, G, X, δ, τ); Algorithm 3 Online Decision Tree Induction with IMAC; Function 4 Attempt To Split With IMAC(l, G, X, δ, τ, K)
Open Source Code Yes Our code is available at Git Hub1. 1https://github.com/yearsj/IMAC
Open Datasets Yes We use large streams consisting of well known real-world and synthetic datasets. Table 1 shows detailed information. Synthetic data (SEA [Street and Kim, 2001], LED [Breiman et al., 1984], AGR [Agrawal et al., 1993], RTG [Domingos and Hulten, 2000], RBF) are all generated using the API proposed by MOA. [...] Covertype. The forest covertype data set [...] KDD99. KDD99 dataset [...] MNIST8M. MNIST8M is the augmentation of original MNIST [Le Cun et al., 1998] database by using pseudorandom deformations and translations [Loosli et al., 2007].
Dataset Splits No The paper uses standard datasets but does not explicitly state the train/validation/test dataset splits, specific percentages, or a cross-validation setup.
Hardware Specification Yes All experiments are conducted on a standard server with 36 cores and 125GB memory.
Software Dependencies No All algorithms and experiments are implemented on the Massive Online Analysis (MOA) platform [Bifet et al., 2010], which is one of the most popular open-source frameworks for data stream mining. (No version numbers provided for MOA or other software dependencies).
Experiment Setup Yes VFDT with default parameters (nmin = 200, τ = 0.05, δ = 1e 7), uses the majority class in leaves for classification and information gain as the heuristic measure. Since nmin is 200, to compare with VFDT and OSM at the same level, parameter µ and η in IMAC are both set to 200.