reproducibilityindex.ai

On Coresets for Logistic Regression

Authors: Alexander Munteanu, Chris Schwiegelshohn, Christian Sohler, David Woodruff

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We illustrate the performance of our method by comparing to uniform sampling as well as to state of the art methods in the area. The experiments are conducted on real world benchmark data for logistic regression.
Researcher Affiliation	Academia	Alexander Munteanu Department of Computer Science TU Dortmund University 44227 Dortmund, Germany alexander.munteanu@tu-dortmund.de Chris Schwiegelshohn Department of Computer Science Sapienza University of Rome 00185 Rome, Italy schwiegelshohn@diag.uniroma1.it Christian Sohler Department of Computer Science TU Dortmund University 44227 Dortmund, Germany christian.sohler@tu-dortmund.de David P. Woodruff Department of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA dwoodruf@cs.cmu.edu
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper states, 'We implemented our algorithms in Python.' and refers to using 'parts of their original Python code' for a baseline method and 'the standard gradient based optimizer from the scipy.optimize package', but it does not provide a link or explicit statement about making their own code open source for the method described in the paper.
Open Datasets	Yes	The WEBB SPAM1 data consists of 350, 000 unigrams with 127 features from web pages [...]. The COVERTYPE2 data consists of 581, 012 cartographic observations of different forests with 54 features. [...] The KDD CUP 993 data comprises 494, 021 network connections with 41 features [...]. Footnotes provide URLs for these datasets: 1https://www.cc.gatech.edu/projects/doi/WebbSpam Corpus.html 2https://archive.ics.uci.edu/ml/datasets/covertype 3http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
Dataset Splits	No	The paper does not explicitly provide details about training, validation, and test splits. It mentions 'assessed the total running times for computing the sampling probabilities, sampling and optimizing on the sample' and 'subsampling algorithms for a number of thirty regular subsampling steps in the range k [ 2 n , n/16 ]', but no specific split ratios or methodology for dividing the data into training, validation, or test sets.
Hardware Specification	Yes	All experiments were run on a Linux machine using an Intel i7-6700, 4 core CPU at 3.4 GHz, and 32GB of RAM.
Software Dependencies	No	The paper states, 'We implemented our algorithms in Python.' and 'The subsequent optimization was done for all approaches with the standard gradient based optimizer from the scipy.optimize package, see http://www.scipy.org/.' However, it does not provide specific version numbers for Python or the scipy.optimize package.
Experiment Setup	Yes	For each data set, we ran all three subsampling algorithms for a number of thirty regular subsampling steps in the range k [ 2 n , n/16 ]. For each step, we present the mean relative error as well as the trade-off between mean relative error and running time, taken over twenty independent repetitions, in Figure 1. [...] The approach of [27] is based on a k-means++ clustering [3] on a small uniform sample of the data and was performed using standard parameters taken from the publication. [...] The subsequent optimization was done for all approaches with the standard gradient based optimizer from the scipy.optimize package.