Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

A Coreset Learning Reality Check

Authors: Fred Lu, Edward Raff, James Holt

AAAI 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our work, we directly compare multiple methods for logistic regression drawn from the coreset and optimal subsampling literature and discover inconsistencies in their effectiveness. In many cases, methods do not outperform simple uniform subsampling. ... In our experiment design we account for these limitations by benchmarking most known subsampling methods for logistic regression over a large variety of realistic datasets. We are the first to present a thorough empirical comparison of these approaches, over a range of important metrics.
Researcher Affiliation	Collaboration	Fred Lu1, 2, 3, Edward Raff1, 2, 3, James Holt3 1 Booz Allen Hamilton 2 University of Maryland, Baltimore County 3 Laboratory for Physical Sciences
Pseudocode	Yes	Algorithm 1 Coreset sampling procedure
Open Source Code	No	The paper discusses using source code from other authors ('Checking with the source code, we identified that the authors weighted the pilot sample') but does not state that the authors are providing their own implementation code for the methodology described in this paper, nor is a link provided.
Open Datasets	Yes	Our experiment design we account for these limitations by benchmarking most known subsampling methods for logistic regression over a large variety of realistic datasets. We evaluate on 8 datasets which include previously used ones as well as new ones. The sizes range from 24000 to nearly 5 million (Table 2). Table 2 lists: chemreact, census, bank, webspam, kddcup, covtype, bitcoin, SUSY. Additional details on dataset preprocessing and sources are in the Appendix.
Dataset Splits	Yes	Relative ROC of the subsampled model on validation data: ROC(ˆβC, X, y)/ROC(ˆβMLE, X, y)... We replicate each procedure 50 times and report the medians and inter-quartile intervals.
Hardware Specification	No	The paper does not provide specific details about the hardware used to run its experiments (e.g., GPU/CPU models, memory specifications).
Software Dependencies	No	The paper mentions 'numpy.linalg.qr routine' and 'JASP software' but does not provide specific version numbers for these or any other ancillary software dependencies.
Experiment Setup	Yes	In our main experiments, we use weak L2 regularization at λ = 10−5... We replicate each procedure 50 times and report the medians and inter-quartile intervals.