Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Convex and Non-Convex Approaches for Statistical Inference with Class-Conditional Noisy Labels

Authors: Hyebin Song, Ran Dai, Garvesh Raskutti, Rina Foygel Barber

JMLR 2020 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our theoretical findings through simulations and a real-data example. In Section 6 and 7, we apply convex and non-convex methods to synthetic and real data and compare the performance of the two estimators.
Researcher Affiliation Academia Hyebin Song EMAIL Department of Statistics The Pennsylvania State University State College, PA 16801, USA Ran Dai EMAIL Department of Biostatistics University of Nebraska Medical Center Omaha, NE 68198, USA Garvesh Raskutti EMAIL Department of Statistics University of Wisconsin-Madison Madison, WI 53706, USA Rina Foygel Barber EMAIL Department of Statistics University of Chicago Chicago, IL 60637, USA
Pseudocode No The paper describes methods in prose and mathematical formulations without including explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide explicit statements or links indicating that the source code for the methodology described is publicly available. It only refers to data availability and the use of third-party R packages.
Open Datasets Yes The data set we analyze is a positive and unlabeled beta-glucosidase protein sequence data set generated in the Romero Lab (Romero et al., 2015). Large-scale data were generated by deep mutational scanning (DMS) method... The raw data is available in https://github.com/RomeroLab/seq-fcn-data.git
Dataset Splits Yes In addition, to compare predictive performance of the two methods, we split the data set into training and test sets using 90% and 10% of the sequence examples.
Hardware Specification No The paper does not provide any specific details about the hardware used for running the experiments, such as CPU or GPU models, or specific computing environments.
Software Dependencies No For bβref and bβH ref we used the glm() function from R base package and glmnet() from R package glmnet respectively. (Lacks version numbers for R or the glmnet package)
Experiment Setup Yes For non-convex problems, we initialize coefficients at the null model where β = [0, . . . , 0] if a problem is in the low-dimensional regime, and we use a local initialization using a convex estimate otherwise. In terms of optimization, we use the proximal gradient method combined with a backtracking line search to solve optimization problems of (16) and (21). Tuning parameter λ needs to be chosen for the high-dimensional estimators. We choose λ in each simulation based on the testing loss from 5-fold cross validation. The model is then refitted using .1%, 1%, 10%, and 100% of the examples in the training set to compare the performance of the two methods at various sample sizes.