Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Initialization of Feature Selection Search for Classification

Authors: Maria Luque-Rodriguez, Jose Molina-Baena, Alfonso Jimenez-Vilchez, Antonio Arauzo-Azofra

JAIR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper proposes the systematic application of individual feature evaluation methods to initialize search-based feature subset selection methods. An exhaustive review of the starting methods used by genetic algorithms from 2014 to 2020 has been carried out. Subsequently, an in-depth empirical study has been carried out evaluating the proposal for different search-based feature selection methods (Sequential forward and backward selection, Las Vegas filter and wrapper, Simulated Annealing and Genetic Algorithms).
Researcher Affiliation Academia Maria Luque-Rodriguez EMAIL Jose Molina-Baena EMAIL Alfonso Jimenez-Vilchez EMAIL Dept. of Computer Science and Numerical Analysis, Universidad de C ordoba (Spain) Antonio Arauzo-Azofra EMAIL Area of Project Engineering, Univerdad de C ordoba (Spain), Campus de Rabanales, Cordoba 14014 Spain
Pseudocode Yes Algorithm 1 SFS (Sequential Forward Selection) 1: S0 ; k 0 Start with the empty set 2: while Sk = F and J(Sk) < J do 3: x+ argmax x/ Sk J(Sk {x}) Select the next best feature 4: Sk+1 Sk {x+} ; k k + 1 Update
Open Source Code No The paper states: "The feature selection methods have been programmed in Python. The software used for learning methods has been Orange (Demˇsar et al., 2013) component-based data mining software, except for artificial neural networks, where SNNS (U. of Stuttgart,1995) was used, integrated in Orange with Orange SNNS package." This describes the software tools used but does not provide a specific link or statement about the authors' code being made available.
Open Datasets Yes In order to include a wide range of classification problems, the following publicly available repositories were explored seeking for representative problems with diverse properties (discrete and continuous data, different number of classes, features, examples, and unknown values): UCI (Newman & Merz, 1998), OPENML (Demˇsar et al., 2013) and a dataset called Parity3+3 generated artificially. Finally, 27 data sets were chosen. They are listed along with their main properties in Table 2
Dataset Splits Yes In order to get a reliable estimate of these variables, every experiment has been performed using 10-fold stratified cross-validation. For each experiment, we have taken the mean and standard deviation of the ten folds. [...] All these comparisons use the average of ten fold cross-validation to get an stable and confident result.
Hardware Specification Yes Experiments have run on a cluster of 8 nodes with Intel Xeon E5420 CPU 2.50GHz processor and 2 nodes with Intel Xeon E5630 CPU 2.53GHz, under Ubuntu 16.04 GNU/Linux operating system.
Software Dependencies No The paper mentions: "The feature selection methods have been programmed in Python. The software used for learning methods has been Orange (Demˇsar et al., 2013) component-based data mining software, except for artificial neural networks, where SNNS (U. of Stuttgart,1995) was used, integrated in Orange with Orange SNNS package." It lists software names (Python, Orange, SNNS, Ubuntu GNU/Linux operating system) but generally lacks specific version numbers for the key libraries or frameworks beyond the OS version.
Experiment Setup Yes All evaluation functions are parameter free except Relief-F. For this measure, the number of neighbours to search was set to 6, and the number of instances to sample was set to 100. Some of the learning algorithms require parameter fitting. In the case of k NN, k was set to 15 after testing that this value worked reasonably well on all data sets used. The multi-layer perceptron used has one layer trained during 250 cycles with a propagation value of 0.1. For SVM we used Orange.SVMLearner Easy method to fit parameters to each case automatically. [...] Its main parameter is the number of limit assessments, which in both has been set to 1000 feature sets. LVF needs to limit the reduction allowed for the relevance of the features and it has been set at 1%. In metauristic methods (SA and GA) the first restriction we have used in their parameters is that they perform 1000 evaluations. In SA it is necessary to set: Initial temperature: We use T0 = ( v/ ln(φ))(S0), which allows a probability φ of accepting a solution that is v by one worse than the initial solution S0. We take v = 0.3, and φ = 0.3. Generation of neighbours: The generation of new neighbours is done by adding or eliminating features of the current set. 20 neighbours are generated in each cooling, that is, for each temperature value. Cooling scheme: We have used the Cauchy one (T = T0/(1 + i)). 50 coolings are carried out. In GA it is necessary to set: A generational type, with a population of 40 individuals and 50 generations. Simple one point crossover, on the binary representation of the set of selected features. Crossover probability: 0.6. The mutation adds or removes one feature. Mutation probability: 0.001. For the initialization method, to calculate the probability of preselecting features in the GA, after conducting some preliminary experiments, the parameters that obtained good results are the following: Pdesired = 0.5 Pmax = 0.7