Privacy-Preserving Feature Selection with Secure Multiparty Computation
Authors: Xiling Li, Rafael Dowsley, Martine De Cock
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To demonstrate the feasibility of our approach for practical data science, we perform experiments with the proposed MPC protocols for feature selection in a commonly used machine-learning-as-a-service configuration where computations are outsourced to multiple servers, with semi-honest and with malicious adversaries. Regarding effectiveness, we show that secure feature selection with the proposed protocols improves the accuracy of classifiers on a variety of real-world data sets, without leaking information about the feature values or even which features were selected. Regarding efficiency, we document runtimes ranging from several seconds to an hour for our protocols to finish, depending on the size of the data set and the security settings. |
| Researcher Affiliation | Academia | 1School of Engineering and Technology, University of Washington, Tacoma, Washington, USA 2Faculty of Information Technology, Monash University, Clayton, Australia 3Department of Appl. Math., Computer Science and Statistics, Ghent University, Ghent, Belgium. |
| Pseudocode | Yes | Protocol 1 Protocol πFILTER FS for Secure Filter based Feature Selection, Protocol 2 Protocol πMS GINI for Secure MS-GINI Score of a Feature, Protocol 3 Protocol πGINI FS for Secure Filter-based Feature Selection with MS-GINI |
| Open Source Code | No | The paper does not provide an explicit statement or link for the open-source code specific to the methodology described in this paper. |
| Open Datasets | Yes | 1https://www.ubittention.org/2020/data/Cognitive-load%20challenge%20description.pdf, 2https://archive.ics.uci.edu/ml/datasets/LSVT+Voice+Rehabilitation, 3https://www.openml.org/d/40536 |
| Dataset Splits | Yes | The first 4 columns of Table 1 contain details for three data sets corresponding to binary classification tasks with continuous valued input features: Cognitive Load Detection (Cog Load) (Gjoreski et al., 2020), Lee Silverman Voice Treatment (LSVT) (Tsanas et al., 2014), and Speed Dating (SPEED) (Fisman et al., 2006), along with the number of instances m, raw features p, selected features k, and folds for cross-validation (CV). We used grid search to select an appropriate value of k for each data set and retained the value of k with the best accuracy. |
| Hardware Specification | Yes | All benchmark tests were completed on 3 or 4 co-located F32s V2 Azure virtual machines. Each VM contains 32 cores, 64 Gi B of memory, and up to a 14 Gbps network bandwidth between each virtual machine. |
| Software Dependencies | Yes | To obtain these results, we implemented πGINI FS along with the supporting protocols πMS GINI and πFILTER FS in MP-SPDZ (Keller, 2020). |
| Experiment Setup | Yes | The remaining columns of Table 1 contain accuracy results by averaging from CV for logistic regression (LR) models trained on the RAW data sets with all p features, and on reduced data sets with only the top k features selected with a variety of scoring techniques... We used grid search to select an appropriate value of k for each data set and retained the value of k with the best accuracy. The runtime results are for semi-honest ( passive ) and malicious ( active ) adversary models (see Sec. 2.2) in a 3PC or 4PC honest-majority setting over a ring Zq with q = 264. |