Scalable Feature Selection via Distributed Diversity Maximization

Authors: Sepehr Zadeh, Mehrdad Ghadiri, Vahab Mirrokni, Morteza Zadimoghaddam

AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show the effectiveness of our method by performing an extensive empirical study. In particular, we show that our distributed method outperforms state-of-the-art centralized feature selection algorithms on a variety of datasets. From a theoretical point of view, we have proved that the used greedy algorithm in our method achieves an approximation factor of 1/4 for the diversity maximization problem in a distributed setting with high probability. Furthermore, we improve this to 8/25 expected approximation using multiplicity in our distribution.
Researcher Affiliation Collaboration Sepehr Abbasi Zadeh, Mehrdad Ghadiri Department of Computer Engineering Sharif University of Technology {sabbasizadeh, ghadiri}@ce.sharif.edu Vahab Mirrokni, Morteza Zadimoghaddam Google Research {mirrokni, zadim}@google.com
Pseudocode Yes Algorithm 1: GREEDY; Algorithm 2: DDISMI
Open Source Code Yes For the sake of reproducibility, we have provided all the codes in the supplemental material.
Open Datasets Yes We have described these datasets in detail in the supplemental material. We considered a variety of MI-based filter methods, namely Mutual Information Quotient (MIQ) (Ding and Peng 2005), Minimum Redundancy Maximum Relevance (m RMR) (Ding and Peng 2005), Joint Mutual Information (JMI) (Yang and Moody 1999), Spectral Conditional Mutual Information (Spec CMI) (Nguyen et al. 2014), Quadratic Programming Feature Selection (QPFS) (Rodriguez-Lujan et al. 2010), and Conditional Mutual Information (CMI)(Cheng et al. 2008) as baselines as well as non MI-based methods, fisher score (Duda, Hart, and Stork 2001) and relief F (Robnik-ˇSikonja and Kononenko 2003).
Dataset Splits Yes A 10-fold CV is being used for the datasets with more than 100 instances and for the others, we employ the leave-one-out CV.
Hardware Specification No The paper mentions running experiments on "machines" and discusses speed-up in terms of number of machines, but it does not specify any particular hardware components like CPU models, GPU models, or memory specifications.
Software Dependencies No The paper mentions "The LIBSVM package (Chang and Lin 2011)" as an underlying implementation for SVM, but it does not provide a specific version number for LIBSVM or any other software dependencies needed for reproducibility.
Experiment Setup Yes The regularization factor of our algorithm (λ) is set to be 0.8 in all of our experiments. We change |S| from 10 to min{100, n}, where n is the number of the features in each dataset. The LIBSVM package (Chang and Lin 2011) is the underlying implementation of the linear SVM with its regularization factor set to 1. In order to compute the probabilities used in MIs and VIs we have discretized the continuous variables using the Minimum Description Length (MDL) method (Irani and Fayyad 1993) with 5 bins.