Parallel Feature Selection Inspired by Group Testing

Authors: Yingbo Zhou, Utkarsh Porwal, Ce Zhang, Hung Q Ngo, Xuanlong Nguyen, Christopher Ré, Venu Govindaraju

NeurIPS 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper presents a parallel feature selection method for classification that scales up to very high dimensions and large data sizes. Our original method is inspired by group testing theory, under which the feature selection procedure consists of a collection of randomized tests to be performed in parallel. Each test corresponds to a subset of features, for which a scoring function may be applied to measure the relevance of the features in a classification task. We develop a general theory providing sufficient conditions under which true features are guaranteed to be correctly identified. Superior performance of our method is demonstrated on a challenging relation extraction task from a very large data set that have both redundant features and sample size in the order of millions. We present comprehensive comparisons with state-of-the-art feature selection methods on a range of data sets, for which our method exhibits competitive performance in terms of running time and accuracy. Moreover, it also yields substantial speedup when used as a pre-processing step for most other existing methods. ... 3 Experimental results ... 3.1 Synthetic experiments ... 3.2 Real-world data experiment results
Researcher Affiliation Academia Yingbo Zhou Utkarsh Porwal CSE Department SUNY at Buffalo {yingbozh, utkarshp}@buffalo.edu Ce Zhang CS Department University of Wisconsin-Madison czhang@cs.wisc.edu Hung Ngo CSE Department SUNY at Buffalo hungngo@buffalo.edu Xuan Long Nguyen EECS Department University of Michigan xuanlong@umich.edu Christopher R e CS Department Stanford University chrismre@cs.stanford.edu Venu Govindaraju CSE Department SUNY at Buffalo govind@buffalo.edu
Pseudocode No The paper does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statements about releasing the source code for the described methodology or links to a code repository.
Open Datasets Yes Large: TAC-KBP is a large data set with the number of samples and dimensions in the millions3; its domain is on relation extraction from natural language text. Medium: GISETTE and MADELON are two largest data sets from the NIPS 2003 feature selection challenge4, with the number of dimensions in the thousands. Small: Colon, Leukemia, Lymph, NCI9, and Lung are chosen from the small Micro-array datasets [6], along with the UCI datasets5. ... 3http://nlp.cs.qc.cuny.edu/kbp/2010/ ... 4http://www.nipsfsc.ecs.soton.ac.uk/datasets/ ... 5http://archive.ics.uci.edu/ml/
Dataset Splits Yes To get the test results, we use the features according to the smallest validation error for each method, and the results on test set are illustrated in table 4.
Hardware Specification Yes We also try different degree of parallelism by running our approach using a single thread, 4-threads on a 4-core CPU, 32 threads on a single 8-CPU (4core/CPU) machine, and multiple machines available in the national Open Science Grid (OSG).
Software Dependencies No The paper mentions various algorithms and classifiers (e.g., MIM, FGM, k-nearest neighbor classifier, linear support vector machine, logistic regression) but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes the final classification is done using k-nearest neighbor classifier with k fixed to three, and applied Euclidean distance. ... We bin each dimension of the data into five equal distanced bins when the data is real valued, otherwise the data is not processed.