Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Making Decision Trees Feasible in Ultrahigh Feature and Label Dimensions

Authors: Weiwei Liu, Ivor W. Tsang

JMLR 2017 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive empirical studies verify that 1) SBT is easy to understand and interpret in ultrahigh dimensions and is more resilient to noisy features. 2) Compared with state-of-the-art algorithms, our proposed sparse coding tree framework is more efficient, yet accurate in ultrahigh label and feature dimensions.
Researcher Affiliation Academia Weiwei Liu EMAIL Centre for Artificial Intelligence FEIT, University of Technology Sydney NSW 2007, Australia and School of Computer Science and Engineering The University of New South Wales, Sydney NSW 2052, Australia Ivor W. Tsang EMAIL Centre for Artificial Intelligence FEIT, University of Technology Sydney NSW 2007, Australia
Pseudocode Yes Algorithm 1 Cutting Plane Algorithm for Solving Problem (7) Input: Data set {xi, yi}n i=1. Initialize α0. 1: t = 0, Ct = . 3: t = t + 1. 4: Worst-case analysis: Finding the most violated ϱt based on αt 1 and set Ct = Ct 1 {ϱt}. The details can be referred to Section 4.1.1. 5: Master-problem optimization: Solving problem (8) over Ct and obtaining the optimal solution αt. The details can be referred to Section 4.1.2. 6: until ϵ-optimal Algorithm 2 Supervised Budgeted Tree (SBT) Input: Training data (x1, y1), , (xn, yn), where xi Rm and yi { 1}. Output: A supervised budgeted tree 1: while BAC contains more than one class do 2: train the BAC on this node. 3: split the training data of this node into left child if these instances are predicted to be positive by the BAC and call Algorithm 2 with split data as the input. 4: split the rest of training data of this node into right child and call Algorithm 2 with the remaining data as the input. 5: end while 6: Return the complete supervised budgeted tree. Algorithm 3 Powerset Tree (PT) for Label Powerset Annotation Input: Given ϖ transformed classes S, corresponding frequencies P, and Tsi(i = 1, , ϖ). Output: Powerset Tree. 1: Initialize a priority queue Q = with the ascending order of the frequency. 2: Create a leaf node for each transformed class si and insert it to Q. 3: while |Q| > 1 do 4: remove the two nodes si and sj of the lowest frequency from Q. 5: create a new internal node sk with si and sj as children and its frequency equals to the sum of the two nodes frequency. 6: set the instance Tsi as positive sample and Tsj as negative sample, then construct a BAC for the node sk. 7: insert the new node sk to Q. 8: end while 9: The remaining node is the root node and the tree is complete. Algorithm 4 Annotation Tree (AT) for Multi-label Annotation Input: Given node g and its corresponding Gg, Tg and Lg. Output: Annotation Tree 1: Compute Zerog, Oneg and Wg 2: while Wg = do 3: select label lg Wg with the highest frequency 4: build a BAC for lg 5: set the label set of left and right child as Wg\lg 6: split the training instances of node g into left child g1 if the values of label lg of these instances are 1 and call Algorithm 4 with input (Wg\lg, Tg1) 7: split the rest of training instances of node g into right child g0 and call Algorithm 4 with input (Wg\lg, Tg0) 8: end while 9: The tree is complete
Open Source Code No The codes of other baseline methods are provided by their authors.
Open Datasets Yes Most data sets are collected from LIBSVM website2. pcmac data set is from (Xu et al., 2014). The training/testing partition is either predefined or the data is randomly split into 80% training and 20% testing. ... The data sets fall into two categories. The first category contains five small and medium-sized data sets. The second category contains three large-scale data sets with very high label and feature dimensions. The details of data sets are shown in Table 7. ... collected from website45. (footnote 2: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/, footnote 4: http://mulan.sourceforge.net, footnote 5: http://manikvarma.org/#Julian15a)
Dataset Splits Yes The training/testing partition is either predefined or the data is randomly split into 80% training and 20% testing. ... We use 5-fold cross validation to prune SBT. ... We use the publicly available split of the training and testing sets for very high dimensional data sets and perform 10-fold cross-validations on small and medium-sized data sets and report the mean and standard error of the F1 measure.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. It discusses computational efficiency and time complexity, but not the underlying hardware.
Software Dependencies No We use the linear classification/regression package LIBLINEAR (Fan et al., 2008) with L2-regularized square hinge loss (primal) to implement standard SVM. ... The codes of other baseline methods are provided by their authors.
Experiment Setup Yes Following the parameter settings in (Tan et al., 2014), B is chosen in a range of {2, 5, 10, 20, 50, 100, 150, 200, 250, 400} for rcv, news20 and webspam data sets, and {0.01m, 0.02m, , 0.09m} for other data sets. C is selected using 5-fold cross validation over the range {0.001, 0.01, 0.1, 5, 10} for the first three data sets and we fix C = 5 for other data sets. Following the settings in (Oiwa and Fujimaki, 2014), the tree-depth is fixed to 3 in LDKL. ... B is chosen in a range of {1, 3, 5, 10} using cross validation for SBT. ... According to the original setting in (Bhatia et al., 2015), we set the number of the clusters as n/6000 , the number of learners as 15 and the dimension of the embedding space as 50 for small and medium-sized data sets, and 100 for very high dimensional data sets. ... Following (Prabhu and Varma, 2014), T is fixed to 50 for Fast XML. ... According to (Tsoumakas et al., 2008), k is chosen in a range of {2, 3, , 8} using cross validation for Homer.