Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

Authors: Jin-Hong Du, Pratik Patil, Arun K. Kuchibhotla

ICML 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Figure 2 shows both the GCV estimate and the asymptotic risk for the full ridge ensemble. We observe a close match of the theoretical curves and the GCV estimates. and In Figure 3, we numerically compare the optimal subsampled ridgeless ensemble with the optimal ridge predictor to verify Corollary 3.2. As we can see, their theoretical curves exactly match, and the empirical estimates in finite samples are also close to their asymptotic limits. and We compare tuning subsample size in the full ridgeless ensemble with tuning the ridge parameter on the full data in a real-world data example from multiomics.
Researcher Affiliation	Academia	1Department of Statistics and Data Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA. 2Department of Statistics, University of California, Berkeley, CA 94720, USA.
Pseudocode	No	The paper describes methods using mathematical formulations and textual descriptions but does not include any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any explicit statement or link indicating the availability of open-source code for the described methodology.
Open Datasets	Yes	This single-cell CITE-seq dataset from Hao et al. (2021) consists of 50,781 human peripheral blood mononuclear cells (PBMCs)...
Dataset Splits	Yes	We randomly hold out half of the cells in each cell type as a test set.
Hardware Specification	No	The paper does not specify any particular hardware components (e.g., GPU models, CPU models, memory sizes) used for running the experiments.
Software Dependencies	No	The paper mentions software like the 'glmnet' package but does not provide specific version numbers for any software dependencies required for reproducibility.
Experiment Setup	Yes	For the former, we search over the grid of 25 k s from nν to n spaced evenly on the log scale, with ν = 0.5 and sample size n ranges from 516 to 7864 for different cell types. For the latter, we search over the grid of 100 λ s from 10 2 to 102 spaced evenly on log scale.