Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Convex and Non-Convex Approaches for Statistical Inference with Class-Conditional Noisy Labels

Authors: Hyebin Song, Ran Dai, Garvesh Raskutti, Rina Foygel Barber

JMLR 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate our theoretical ﬁndings through simulations and a real-data example. In Section 6 and 7, we apply convex and non-convex methods to synthetic and real data and compare the performance of the two estimators.
Researcher Affiliation	Academia	Hyebin Song EMAIL Department of Statistics The Pennsylvania State University State College, PA 16801, USA Ran Dai EMAIL Department of Biostatistics University of Nebraska Medical Center Omaha, NE 68198, USA Garvesh Raskutti EMAIL Department of Statistics University of Wisconsin-Madison Madison, WI 53706, USA Rina Foygel Barber EMAIL Department of Statistics University of Chicago Chicago, IL 60637, USA
Pseudocode	No	The paper describes methods in prose and mathematical formulations without including explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide explicit statements or links indicating that the source code for the methodology described is publicly available. It only refers to data availability and the use of third-party R packages.
Open Datasets	Yes	The data set we analyze is a positive and unlabeled beta-glucosidase protein sequence data set generated in the Romero Lab (Romero et al., 2015). Large-scale data were generated by deep mutational scanning (DMS) method... The raw data is available in https://github.com/RomeroLab/seq-fcn-data.git
Dataset Splits	Yes	In addition, to compare predictive performance of the two methods, we split the data set into training and test sets using 90% and 10% of the sequence examples.
Hardware Specification	No	The paper does not provide any specific details about the hardware used for running the experiments, such as CPU or GPU models, or specific computing environments.
Software Dependencies	No	For bβref and bβH ref we used the glm() function from R base package and glmnet() from R package glmnet respectively. (Lacks version numbers for R or the glmnet package)
Experiment Setup	Yes	For non-convex problems, we initialize coeﬃcients at the null model where β = [0, . . . , 0] if a problem is in the low-dimensional regime, and we use a local initialization using a convex estimate otherwise. In terms of optimization, we use the proximal gradient method combined with a backtracking line search to solve optimization problems of (16) and (21). Tuning parameter λ needs to be chosen for the high-dimensional estimators. We choose λ in each simulation based on the testing loss from 5-fold cross validation. The model is then reﬁtted using .1%, 1%, 10%, and 100% of the examples in the training set to compare the performance of the two methods at various sample sizes.