reproducibilityindex.ai

Revisiting Probability Distribution Assumptions for Information Theoretic Feature Selection

Authors: Yuan Sun, Wei Wang, Michael Kirley, Xiaodong Li, Jeffrey Chan5908-5915

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct detailed empirical studies across a suite of 29 real-world classiﬁcation problems and illustrate improved prediction accuracy of our methods based on the identiﬁcation of more informative features, thus providing support for our theoretical ﬁndings.
Researcher Affiliation	Academia	1RMIT University, Melbourne, Australia 2University of Melbourne, Parkville, Australia
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	The Python and C++ source codes of these methods are available online at https://github.com/yuansuny/pda.
Open Datasets	Yes	These methods are then used to select a subset of features for 29 real-world classiﬁcation tasks from the UCI Machine Learning Repository (Lichman 2013).
Dataset Splits	Yes	We calculate the average 10-folder cross-validation error rate on the range of 10 to 100 features (or 10 to M if the number of features M < 100) as an indication of the effectiveness of feature selection methods, following (Nguyen et al. 2014; Gao, Ver Steeg, and Galstyan 2016).
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper mentions "Python and C++ source codes" but does not provide specific version numbers for any libraries, frameworks, or compilers used, which are necessary for reproducible software dependencies.
Experiment Setup	Yes	The features selected by each method are evaluated using two classiﬁers K Nearest Neighbour (KNN) with K = 3 and linear Support Vector Machine (SVM) with the regularization parameter set to 1. [...] For datasets with continuous features, the Minimum Description Length method (Fayyad and Irani 1993) is employed to evenly divide the continuous values into ﬁve bins, following (Vinh et al. 2016). Note that the discretization is only used in the feature selection procedure, while the classiﬁers still use the original continuous values.