Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Approximate Cross-Validation with Low-Rank Data in High Dimensions
Authors: Will Stephenson, Madeleine Udell, Tamara Broderick
NeurIPS 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present numerical experiments that confirm our theoretical predictions and demonstrate the effectiveness of ACV in a range of high-dimensional settings. |
| Researcher Affiliation | Academia | 1 Department of Statistics and Data Science, Carnegie Mellon University, Pittsburgh, PA 15213 2 Department of Mathematics and Statistics, McMaster University, Hamilton, ON L8S 4K1, Canada 3 Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213 |
| Pseudocode | No | The paper describes methods in prose and mathematical formulations but does not include explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide a statement about releasing source code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | We also consider publicly available scRNA-seq data from [30]. |
| Dataset Splits | Yes | For each simulation, we generate N = 1000 training samples and Ntest = 100 test samples. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper states "All numerical experiments were performed using Python 3.9." but does not list specific version numbers for other key libraries or dependencies like scikit-learn that are mentioned. Python 3.9 by itself is not sufficient according to the criteria. |
| Experiment Setup | Yes | We consider two types of synthetic data: Ridge regression and Logistic regression, both with a low-rank feature matrix. For each simulation, we generate N = 1000 training samples and Ntest = 100 test samples, with p covariates (p = 200 or p = 2000) and rank r (r = 1, 5, 10, 20). The noise level is set to σ = 0.1, 0.5, 1.0. For the iterative solver, we set the maximum number of iterations max_iter = 1000 and tolerance tol = 1e-6. The regularization parameter λ is chosen using 5-fold CV on the training data. |