Coresets for Scalable Bayesian Logistic Regression

Authors: Jonathan Huggins, Trevor Campbell, Tamara Broderick

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluated the performance of the logistic regression coreset algorithm on a number of synthetic and real-world datasets. Experiments on a variety of synthetic and real-world datasets validate our approach and demonstrate robustness to the choice of algorithm hyperparameters.
Researcher Affiliation Academia Jonathan H. Huggins Trevor Campbell Tamara Broderick Computer Science and Artificial Intelligence Laboratory, MIT {jhuggins@, tdjc@, tbroderick@csail.}mit.edu
Pseudocode Yes Algorithm 1 Construction of logistic regression coreset
Open Source Code Yes Code to recreate all of our experiments is available at https://bitbucket.org/jhhuggins/lrcoresets.
Open Datasets Yes The CHEMREACT dataset consists of N = 26,733 chemicals... The WEBSPAM corpus consists of N = 350,000 web pages... The cover type (COVTYPE) dataset consists of N = 581,012 cartographic observations... (Synthetic data generation refers to Scott et al. [21])
Dataset Splits No The paper specifies test set sizes (e.g., '10^3 additional data points were generated for testing' for synthetic data, and '2,500 (resp. 50,000 and 29,000) data points of the CHEMREACT (resp. WEBSPAM and COVTYPE) dataset were held out for testing' for real data), but does not explicitly provide information about a separate validation dataset split.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper states 'We implemented the logistic regression coreset algorithm in Python' but does not provide specific version numbers for Python or any other key software dependencies.
Experiment Setup Yes We ran adaptive MALA for 100,000 iterations on the full dataset and each subsampled dataset. For the synthetic datasets... we used k = 4 while for the real-world datasets... we used k = 6. We used a heuristic to choose R as large as was feasible... Our experiments used a = 3.