A Provably Accurate Randomized Sampling Algorithm for Logistic Regression

Authors: Agniva Chowdhury, Pradeep Ramuhalli

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To further validate our theoretical findings, we conduct comprehensive empirical evaluations. Overall, our work sheds light on the potential of using randomized sampling approaches to efficiently approximate the estimated probabilities in logistic regression, offering a practical and computationally efficient solution for large-scale datasets.
Researcher Affiliation Academia Agniva Chowdhury1, Pradeep Ramuhalli2 1Computer Science and Mathematics Divsion, Oak Ridge National Laboratory, TN, USA 2Nuclear Energy and Fuel Cycle Division, Oak Ridge National Laboratory, TN, USA {chowdhurya, ramuhallip}@ornl.gov
Pseudocode Yes Algorithm 1: Construct S... Algorithm 2: Sketched logistic regression
Open Source Code No The paper includes a link in a footnote for an Appendix related to proofs, but no explicit statement or link for the source code of their methodology. The footnote says: "https://arxiv.org/abs/2402.16326" which points to an arXiv paper, not source code.
Open Datasets Yes The first dataset, sourced from Kaggle, is the Cardiovascular disease dataset (Halder 2020), featuring 70, 000 12 patient records with a 50% positive case occurrence. This dataset aims to predict the presence of cardiovascular disease. The second dataset, also from Kaggle, is the Bank customer churn prediction dataset, containing 10, 000 10 records with a 20% positive case prevalence, focusing on the classification of customer departure likelihood. The third and final dataset, named the Default of credit card clients dataset, is sourced from the UCI ML Repository (Yeh 2016). It consists of 30, 000 24 records with a 22% positive case ratio and aims to predict the probability of credit card default in the future.
Dataset Splits No The paper mentions running experiments on datasets and comparing performance, but it does not specify any training/validation/test splits or cross-validation strategies. It focuses on varying sample sizes 's' for their subsampling algorithm.
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments (e.g., GPU models, CPU types, memory).
Software Dependencies No The paper mentions using 'numpy.linalg.svd routine' but does not specify any software names with version numbers for reproducibility (e.g., Python version, numpy version, other libraries).
Experiment Setup No The paper does not provide specific experimental setup details such as hyperparameters (e.g., learning rate, batch size, number of epochs) or specific optimizer settings. It focuses on the sampling methodology.