Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Forward-Backward Selection with Early Dropping
Authors: Giorgos Borboudakis, Ioannis Tsamardinos
JMLR 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In the experimental evaluation presented in Section 4, we show that FBEDK is 1-2 orders of magnitude faster than FBS, while at the same time selecting a similar number of features and having similar predictive performance. In a comparison between different members of the FBEDK family and FBS we show that FBED0 and FBED1 also reduce the number of false variable selections, when the data consist of irrelevant variables only. We also investigated the behavior of FBEDK with increasing number of runs K, showing that a relatively small K is sufficient in most cases to reach optimal predictive performance. Afterwards, we compare FBEDK to FBS, feature selection with LASSO (Tibshirani, 1996) and to the Max Min Parents and Children algorithm (MMPC) (Tsamardinos et al., 2003a) and show that it often has comparable predictive performance while selecting the fewest variables overall. Finally, we compare FBEDK to feature selection with LASSO (Tibshirani, 1996) when both algorithms are limited to select the same number of variables, showing that both algorithms perform similarly. |
| Researcher Affiliation | Collaboration | Giorgos Borboudakis EMAIL Ioannis Tsamardinos EMAIL Computer Science Department, University of Crete Gnosis Data Analysis |
| Pseudocode | Yes | Algorithm 1 Forward-Backward Selection (FBS) Input: Dataset D, Target T Output: Selected Variables S... Algorithm 2 Forward-Backward Selection with Early Dropping (FBEDK) Input: Dataset D, Target T, Maximum Number of Runs K Output: Selected Variables S |
| Open Source Code | No | The paper does not provide concrete access to its own source code. It mentions using 'glmnet implementation (Qian et al., 2013)' for LASSO, 'LIBSVM (Chang and Lin, 2011) implementation' for SVMs, and 'TreeBagger implementation in Matlab' for RFs, which are third-party tools. There is no statement from the authors about making their specific implementation of FBEDK or other algorithms in the paper publicly available. |
| Open Datasets | Yes | We used 12 binary classification datasets, with sample sizes ranging from 200 to 16772 and number of variables between 166 and 100000. The datasets were selected from various competitions (Guyon et al., 2004, 2006a) and the UCI repository (Dietterich et al., 1994), and were selected to cover a wide range of variable and sample sizes. A summary of the datasets is shown in Table 1. ... Table 1: Binary classification datasets used in the experimental evaluation. n is the number of samples, p is the number of predictors and P(T = 1) is the proportion of instances where T = 1. Dataset musk (v2) ... Source UCI ML Repository (Dietterich et al., 1994)... sylva ... WCCI 2006 Challenge (Guyon et al., 2006a)... madelon ... NIPS 2003 Challenge (Guyon et al., 2004) |
| Dataset Splits | Yes | For model selection and performance estimation we used a 60/20/20 stratified split of the data, using 60% as a training set, 20% as a validation set and the remaining 20% as a test set. ... For datasets with more than 1000 samples the number of repetitions was set to 10, and to 50 for the rest. |
| Hardware Specification | Yes | All experiments were performed in Matlab, running on a desktop computer with an Intel i7-7700K processor and 32GB of RAM. |
| Software Dependencies | No | The paper mentions 'Matlab' as the environment for implementation and 'glmnet implementation (Qian et al., 2013)' for LASSO, 'LIBSVM (Chang and Lin, 2011) implementation' for SVMs, and 'TreeBagger implementation in Matlab' for Random Forests. It also mentions the 'MXM R package (Lagani et al., 2017)'. However, no specific version numbers are provided for Matlab, glmnet, LIBSVM, TreeBagger, or the MXM R package, which are necessary for reproducible software dependencies. |
| Experiment Setup | Yes | As selection criteria for FBEDK, FBS and MMPC we used a nested likelihood-ratio independence test based on logistic regression. For FBEDK and MMPC, the significance level α of the conditional independence test was set to {0.001, 0.005, 0.01, 0.05, 0.1}, covering a range of commonly used values, while for FBS we explored a total of 100 values, uniformly spaced in [0.001, 0.01]. For the K value of FBEDK we used {0, 1, ..., }, while the maximum conditioning size max K of MMPC was set to {1, 2, 3, 4}. For LASSO-FS we set all parameters to their default values and set the maximum number of λ values, λmax, to 100. ... As linear models we used elastic net regularized logistic regression (Zou and Hastie, 2005), using λmax = 100 and the mixture parameter α set to {0, 0.25, 0.5, 0.75, 1} ... As non-linear models we used Gaussian support vector machines (SVM) (Cortes and Vapnik, 1995) and random forests (RF) (Breiman, 2001). For SVMs ... The cost hyper-parameter C of SVMs was set to {2^10, 2^9, ..., 2^9} (a total of 20 values), while the remaining hyper-parameters were set to their default values. For RFs the number of trees was set to 500, the minimum leaf node size was set to {1, 5, 9} and the number of variables to split at each node was set to {0.5, 1, 1.5} p (9 combinations in total). |