reproducibilityindex.ai

Coresets for Scalable Bayesian Logistic Regression

Authors: Jonathan Huggins, Trevor Campbell, Tamara Broderick

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluated the performance of the logistic regression coreset algorithm on a number of synthetic and real-world datasets. Experiments on a variety of synthetic and real-world datasets validate our approach and demonstrate robustness to the choice of algorithm hyperparameters.
Researcher Affiliation	Academia	Jonathan H. Huggins Trevor Campbell Tamara Broderick Computer Science and Artiﬁcial Intelligence Laboratory, MIT {jhuggins@, tdjc@, tbroderick@csail.}mit.edu
Pseudocode	Yes	Algorithm 1 Construction of logistic regression coreset
Open Source Code	Yes	Code to recreate all of our experiments is available at https://bitbucket.org/jhhuggins/lrcoresets.
Open Datasets	Yes	The CHEMREACT dataset consists of N = 26,733 chemicals... The WEBSPAM corpus consists of N = 350,000 web pages... The cover type (COVTYPE) dataset consists of N = 581,012 cartographic observations... (Synthetic data generation refers to Scott et al. [21])
Dataset Splits	No	The paper specifies test set sizes (e.g., '10^3 additional data points were generated for testing' for synthetic data, and '2,500 (resp. 50,000 and 29,000) data points of the CHEMREACT (resp. WEBSPAM and COVTYPE) dataset were held out for testing' for real data), but does not explicitly provide information about a separate validation dataset split.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies	No	The paper states 'We implemented the logistic regression coreset algorithm in Python' but does not provide specific version numbers for Python or any other key software dependencies.
Experiment Setup	Yes	We ran adaptive MALA for 100,000 iterations on the full dataset and each subsampled dataset. For the synthetic datasets... we used k = 4 while for the real-world datasets... we used k = 6. We used a heuristic to choose R as large as was feasible... Our experiments used a = 3.