Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Importance Sampling for Minibatches

Authors: Dominik Csiba, Peter Richtárik

JMLR 2018 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We illustrate on synthetic problems that for training data of certain properties, our sampling can lead to several orders of magnitude improvement in training time. We then test the new sampling on several popular data sets, and show that the improvement can reach an order of magnitude. 7. Experiments We now comment on the results of our numerical experiments, with both synthetic and real data sets. We plot the optimality gap P(w(t)) P(w ) and in the case of real data also the test error (vertical axis) against the computational eﬀort (horizontal axis).
Researcher Affiliation	Academia	Dominik Csiba EMAIL School of Mathematics University of Edinburgh Edinburgh, United Kingdom. Peter Richt arik EMAIL School of Mathematics University of Edinburgh Edinburgh, United Kingdom.
Pseudocode	Yes	Algorithm 1 df SDCA Csiba and Richt arik (2015)
Open Source Code	No	The paper does not provide explicit statements about releasing source code for the methodology described, nor does it include a link to a code repository. Footnote 3 refers to datasets, not code.
Open Datasets	Yes	We used several publicly available data sets3, summarized in Table 5... Footnote 3: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
Dataset Splits	Yes	We used several publicly available data sets3, summarized in Table 5, which we randomly split into a train (80%) and a test (20%) part.
Hardware Specification	No	The paper does not provide specific hardware details like CPU or GPU models used for the experiments. It focuses on 'computational effort' as an implementation-independent model for time.
Software Dependencies	No	The paper mentions 'Julia' in Table 2 for creating artificial data (e.g., 'L = rand(chisq(1),n)'), but it does not specify a version number for Julia or any other software libraries used for implementing the algorithms.
Experiment Setup	Yes	In all experiments we used the logistic loss: φi(z) = log(1+e yiz) and set the regularizer to λ = maxi X:i /n. ... The values of τ we used to plot are τ {1, 2, 4, 8, 16, 32}.