Median Selection Subset Aggregation for Parallel Inference

Authors: Xiangyu Wang, Peichao Peng, David B Dunson

NeurIPS 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show excellent performance in variable selection, estimation, prediction, and computation time relative to usual competitors.
Researcher Affiliation Academia Xiangyu Wang Dept. of Statistical Science Duke University xw56@stat.duke.edu Peichao Peng Statistics Department University of Pennsylvania ppeichao@yahoo.com David B. Dunson Dept. of Statistical Science Duke University dunson@stat.duke.edu
Pseudocode Yes Algorithm 1 Message algorithm
Open Source Code No The paper states that 'The Lasso part of all algorithms will be implemented by the glmnet package [24].', which refers to a third-party package, not code for the authors' own method. No other statement about source code availability for the 'message' algorithm is provided.
Open Datasets Yes This section assesses the performance of the message algorithm via extensive examples, comparing the results to Full data inference. (denoted as full data ) Subset averaging. Partition and average the estimates obtained on all subsets. (denoted as averaging ) Subset median. Partition and take the marginal median of the estimates obtained on all subsets (denoted as median ) Bolasso. Run Lasso on multiple bootstrap samples and intersect to select model. Then estimate the coefficients based on the selected model. (denoted as Bolasso ) The Lasso part of all algorithms will be implemented by the glmnet package [24]. (We did not use ADMM [25] for Lasso as its actual performance might suffer from certain drawbacks [6] and is reported to be slower than glmnet [26]) 4.1 Synthetic data sets We use the linear model and the logistic model for (p; s) = (1000; 3) or (10,000; 3) with different sample size n and different partition number m to evaluate the performance. The feature vector is drawn from a multivariate normal distribution with correlation ρ = 0 or 0.5. Coefficients β are chosen as, βi ( 1)ber(0.4)(8 log n/ n + |N(0, 1)|), i S Since GIC is intractable to implement (NP hard), we combine it with Lasso for variable selection: Implement Lasso for a set of different λ s and determine the optimal one via GIC. The concrete setup of models are as follows, Case 1 Linear model with ǫ N(0, 22). Case 2 Linear model with ǫ t(0, df = 3). Case 3 Logistic model. For p = 1, 000, we simulate 200 data sets for each case, and vary the sample size from 2000 to 10,000. For each case, the subset size is fixed to 400, so the number of subsets will be changing from 5 to 25. In the experiment, we record the mean square error for ˆβ, probability of selecting the true model and computational time, and plot them in Fig 1 6. For p = 10,000, we simulate 50 data sets for each case, and let the sample size range from 20,000 to 50,000 with subset size fixed to 2000. Results for p = 10,000 are provided in supplementary materials. It is clear that message had excellent performance in all of the simulation cases, with low MSE, high probability of selecting the true model, and low computational time. The other subset-based methods we considered had similar computational times and also had computational burdens that effectively did not increase with sample size, while the full data analysis and bootstrap Lasso approach both were substantially slower than the subset methods, with the gap increasing linearly in sample size. In terms of MSE, the averaging and median approaches both had dramatically worse performance than message in every case, while bootstrap Lasso was competitive (MSEs were same order of magnitude with message ranging from effectively identical to having a small but significant advantage), with both message and bootstrap Lasso clearly outperforming the full data approach. In terms of feature selection performance, averaging had by far the worst performance, followed by the full data approach, which was substantially worse than bootstrap Lasso, median and message, with no clear winner among these three methods. Overall message clearly had by far the best combination of low MSE, accurate model selection and fast computation. 4.2 Individual household electric power consumption This data set contains measurements of electric power consumption for every household with a one-minute sampling rate [27]. The data have been collected over a period of almost 4 years and contain 2,075,259 measurements. There are 8 predictors, which are converted to 74 predictors due to re-coding of the categorical variables (date and time). We use the first 2,000,000 samples as the training set and the remaining 75,259 for testing the prediction accuracy. The data are partitioned into 200 subsets for parallel inference. We plot the prediction accuracy (mean square error for test samples) against time for full data, message, averaging and median method in Fig 7. Bolasso is excluded as it did not produce meaningful results within the time span. To illustrate details of the performance, we split the time line into two parts: the early stage shows how all algorithms adapt to a low prediction error and a later stage captures more subtle performance of faster algorithms (full set inference excluded due to the scale). It can be seen that message dominates other algorithms in both speed and accuracy. 4.3 HIGGS classification The HIGGS data have been produced using Monte Carlo simulations from a particle physics model [28]. They contain 27 predictors that are of interest to physicists wanting to distinguish between two classes of particles. The sample size is 11,000,000. We use the first 10,000,000 samples for training a logistic model and the rest to test the classification accuracy. The training set is partitioned into 1,000 subsets for parallel inference. The classification accuracy (probability of correctly predicting the class of test samples) against computational time is plotted in Fig 8 (Bolasso excluded for the same reason as above). 0.060 0.065 0.070 0.075 0.080 0.0 0.2 0.4 0.6 0.8 Mean prediction error (earlier stage) message median average fullset 0.084 0.086 0.088 0.090 0.092 0.094 0.0016 0.0020 0.0024 Mean prediction error (later stage) message median average Figure 7: Results for power consumption data. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.50 0.55 0.60 0.65 Mean prediction accuracy message median average fullset Figure 8: Results for HIGGS classification. Message adapts to the prediction bound quickly. Although the classification results are not as good as the benchmarks listed in [28] (due to the choice of a simple parametric logistic model), our new algorithm achieves the best performance subject to the constraints of the model class. 5 Discussion and conclusion In this paper, we proposed a flexible and efficient message algorithm for regression and classification with feature selection. Message essentially eliminates the computational burden attributable to communication among machines, and is as efficient as other simple subset aggregation methods. By selecting the median model, message can achieve better accuracy even than feature selection on the full data, resulting in an improvement also in MSE performance. Extensive simulation experiments show outstanding performance relative to competitors in terms of computation, feature selection and prediction. Although the theory described in Section 3 is mainly concerned with linear models, the algorithm is applicable in fairly wide situations. Geometric median is a topological concept, which allows the median model to be obtained in any normed model space. The properties of the median model result from independence of the subsets and weak consistency on each subset. Once these two conditions are satisfied, the property shown in Section 3 can be transferred to essentially any model space. The follow-up averaging step has been proven to be consistent for all M estimators with a proper choice of the partition number [8]. [1] Gonzalo Mateos, Juan Andr es Bazerque, and Georgios B Giannakis. Distributed sparse linear regression. Signal Processing, IEEE Transactions on, 58(10):5262 5276, 2010. [2] Joseph K Bradley, Aapo Kyrola, Danny Bickson, and Carlos Guestrin. Parallel coordinate descent for l1-regularized loss minimization. ar Xiv preprint ar Xiv:1105.5379, 2011. [3] Chad Scherrer, Ambuj Tewari, Mahantesh Halappanavar, and David Haglin. Feature clustering for accelerating parallel coordinate descent. In NIPS, pages 28 36, 2012. [4] Alexander Smola and Shravan Narayanamurthy. An architecture for parallel topic models. Proceedings of the VLDB Endowment, 3(1-2):703 710, 2010. [5] Feng Yan, Ningyi Xu, and Yuan Qi. Parallel inference for latent dirichlet allocation on graphics processing units. In NIPS, volume 9, pages 2134 2142, 2009. [6] Zhimin Peng, Ming Yan, and Wotao Yin. Parallel and distributed sparse optimization. preprint, 2013. [7] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using mini-batches. The Journal of Machine Learning Research, 13:165 202, 2012. [8] Yuchen Zhang, John C Duchi, and Martin J Wainwright. Communication-efficient algorithms for statistical optimization. In NIPS, volume 4, pages 5 1, 2012. [9] Gideon Mann, Ryan T Mc Donald, Mehryar Mohri, Nathan Silberman, and Dan Walker. Efficient large-scale distributed training of conditional maximum entropy models. In NIPS, volume 22, pages 1231 1239, 2009. [10] Steven L Scott, Alexander W Blocker, Fernando V Bonassi, Hugh A Chipman, Edward I George, and Robert E Mc Culloch. Bayes and big data: The consensus monte carlo algorithm. In EFa BBayes 250 conference, volume 16, 2013. [11] Willie Neiswanger, Chong Wang, and Eric Xing. Asymptotically exact, embarrassingly parallel MCMC. ar Xiv preprint ar Xiv:1311.4780, 2013. [12] Xiangyu Wang and David B Dunson. Parallelizing MCMC via weierstrass sampler. ar Xiv preprint ar Xiv:1312.4605, 2013. [13] Stanislav Minsker. Geometric median and robust estimation in banach spaces. ar Xiv preprint ar Xiv:1308.1334, 2013. [14] Stanislav Minsker, Sanvesh Srivastava, Lizhen Lin, and David B Dunson. Robust and scalable bayes via a median of subset posterior measures. ar Xiv preprint ar Xiv:1403.2660, 2014. [15] Angela M Wood, Ian R White, and Patrick Royston. How should variable selection be performed with multiply imputed data? Statistics in medicine, 27(17):3227 3246, 2008. [16] Francis R Bach. Bolasso: model consistent lasso estimation through the bootstrap. In Proceedings of the 25th international conference on Machine learning, pages 33 40. ACM, 2008. [17] Dean P Foster and Edward I George. The risk inflation criterion for multiple regression. The Annals of Statistics, pages 1947 1975, 1994. [18] Sadanori Konishi and Genshiro Kitagawa. Generalised information criteria in model selection. Biometrika, 83(4):875 890, 1996. [19] Jiahua Chen and Zehua Chen. Extended bayesian information criteria for model selection with large model spaces. Biometrika, 95(3):759 771, 2008. [20] Yongdai Kim, Sunghoon Kwon, and Hosik Choi. Consistent model selection criteria on high dimensions. The Journal of Machine Learning Research, 98888(1):1037 1057, 2012. [21] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267 288, 1996. [22] Peng Zhao and Bin Yu. On model selection consistency of lasso. The Journal of Machine Learning Research, 7:2541 2563, 2006. [23] Maria Maddalena Barbieri and James O Berger. Optimal predictive model selection. Annals of Statistics, pages 870 897, 2004. [24] Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1, 2010. [25] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends R in Machine Learning, 3(1):1 122, 2011. [26] Xingguo Li, Tuo Zhao, Xiaoming Yuan, and Han Liu. An R Package flare for high dimensional linear regression and precision matrix estimation, 2013. [27] K. Bache and M. Lichman. UCI machine learning repository, 2013. [28] Pierre Baldi, Peter Sadowski, and Daniel Whiteson. Deep learning in high-energy physics: Improving the search for exotic particles. ar Xiv preprint ar Xiv:1402.4735, 2014.
Dataset Splits No The paper does not explicitly mention a validation set or a three-way split of the data (train/validation/test).
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments.
Software Dependencies No The paper states: 'The Lasso part of all algorithms will be implemented by the glmnet package [24].' However, it does not specify a version number for 'glmnet' or any other software dependencies.
Experiment Setup Yes For p = 1, 000, we simulate 200 data sets for each case, and vary the sample size from 2000 to 10,000. For each case, the subset size is fixed to 400, so the number of subsets will be changing from 5 to 25. [...] For p = 10,000, we simulate 50 data sets for each case, and let the sample size range from 20,000 to 50,000 with subset size fixed to 2000. [...] The feature vector is drawn from a multivariate normal distribution with correlation ρ = 0 or 0.5. Coefficients β are chosen as, βi ( 1)ber(0.4)(8 log n/ n + |N(0, 1)|), i S [...] Case 1 Linear model with ǫ N(0, 22). Case 2 Linear model with ǫ t(0, df = 3). Case 3 Logistic model. [...] The data are partitioned into 200 subsets for parallel inference.