reproducibilityindex.ai

Parallel Feature Selection Inspired by Group Testing

Authors: Yingbo Zhou, Utkarsh Porwal, Ce Zhang, Hung Q Ngo, Xuanlong Nguyen, Christopher Ré, Venu Govindaraju

NeurIPS 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper presents a parallel feature selection method for classiﬁcation that scales up to very high dimensions and large data sizes. Our original method is inspired by group testing theory, under which the feature selection procedure consists of a collection of randomized tests to be performed in parallel. Each test corresponds to a subset of features, for which a scoring function may be applied to measure the relevance of the features in a classiﬁcation task. We develop a general theory providing sufﬁcient conditions under which true features are guaranteed to be correctly identiﬁed. Superior performance of our method is demonstrated on a challenging relation extraction task from a very large data set that have both redundant features and sample size in the order of millions. We present comprehensive comparisons with state-of-the-art feature selection methods on a range of data sets, for which our method exhibits competitive performance in terms of running time and accuracy. Moreover, it also yields substantial speedup when used as a pre-processing step for most other existing methods. ... 3 Experimental results ... 3.1 Synthetic experiments ... 3.2 Real-world data experiment results
Researcher Affiliation	Academia	Yingbo Zhou Utkarsh Porwal CSE Department SUNY at Buffalo {yingbozh, utkarshp}@buffalo.edu Ce Zhang CS Department University of Wisconsin-Madison czhang@cs.wisc.edu Hung Ngo CSE Department SUNY at Buffalo hungngo@buffalo.edu Xuan Long Nguyen EECS Department University of Michigan xuanlong@umich.edu Christopher R e CS Department Stanford University chrismre@cs.stanford.edu Venu Govindaraju CSE Department SUNY at Buffalo govind@buffalo.edu
Pseudocode	No	The paper does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any explicit statements about releasing the source code for the described methodology or links to a code repository.
Open Datasets	Yes	Large: TAC-KBP is a large data set with the number of samples and dimensions in the millions3; its domain is on relation extraction from natural language text. Medium: GISETTE and MADELON are two largest data sets from the NIPS 2003 feature selection challenge4, with the number of dimensions in the thousands. Small: Colon, Leukemia, Lymph, NCI9, and Lung are chosen from the small Micro-array datasets [6], along with the UCI datasets5. ... 3http://nlp.cs.qc.cuny.edu/kbp/2010/ ... 4http://www.nipsfsc.ecs.soton.ac.uk/datasets/ ... 5http://archive.ics.uci.edu/ml/
Dataset Splits	Yes	To get the test results, we use the features according to the smallest validation error for each method, and the results on test set are illustrated in table 4.
Hardware Specification	Yes	We also try different degree of parallelism by running our approach using a single thread, 4-threads on a 4-core CPU, 32 threads on a single 8-CPU (4core/CPU) machine, and multiple machines available in the national Open Science Grid (OSG).
Software Dependencies	No	The paper mentions various algorithms and classifiers (e.g., MIM, FGM, k-nearest neighbor classiﬁer, linear support vector machine, logistic regression) but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	the ﬁnal classiﬁcation is done using k-nearest neighbor classiﬁer with k ﬁxed to three, and applied Euclidean distance. ... We bin each dimension of the data into ﬁve equal distanced bins when the data is real valued, otherwise the data is not processed.