Revisiting Probability Distribution Assumptions for Information Theoretic Feature Selection

Authors: Yuan Sun, Wei Wang, Michael Kirley, Xiaodong Li, Jeffrey Chan5908-5915

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct detailed empirical studies across a suite of 29 real-world classification problems and illustrate improved prediction accuracy of our methods based on the identification of more informative features, thus providing support for our theoretical findings.
Researcher Affiliation Academia 1RMIT University, Melbourne, Australia 2University of Melbourne, Parkville, Australia
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes The Python and C++ source codes of these methods are available online at https://github.com/yuansuny/pda.
Open Datasets Yes These methods are then used to select a subset of features for 29 real-world classification tasks from the UCI Machine Learning Repository (Lichman 2013).
Dataset Splits Yes We calculate the average 10-folder cross-validation error rate on the range of 10 to 100 features (or 10 to M if the number of features M < 100) as an indication of the effectiveness of feature selection methods, following (Nguyen et al. 2014; Gao, Ver Steeg, and Galstyan 2016).
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions "Python and C++ source codes" but does not provide specific version numbers for any libraries, frameworks, or compilers used, which are necessary for reproducible software dependencies.
Experiment Setup Yes The features selected by each method are evaluated using two classifiers K Nearest Neighbour (KNN) with K = 3 and linear Support Vector Machine (SVM) with the regularization parameter set to 1. [...] For datasets with continuous features, the Minimum Description Length method (Fayyad and Irani 1993) is employed to evenly divide the continuous values into five bins, following (Vinh et al. 2016). Note that the discretization is only used in the feature selection procedure, while the classifiers still use the original continuous values.