Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

PAC Learning with Improvements

Authors: Idan Attias, Avrim Blum, Keziah Naggita, Donya Saless, Dravyansh Sharma, Matthew Walter

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In Section 6, we conduct experiments on three realworld and one fully synthetic binary classification tabular datasets to investigate how the error rate of a model function (h) decreases when test-set agents that it initially classified as negative improve. Our results indicate that while risk-averse models may start with higher error rates, their errors rapidly drop as the negatively classified test agents improve and the improvement budget (r) increases.
Researcher Affiliation	Academia	Idan Attias 1 2 Avrim Blum 2 Keziah Naggita 2 Donya Saless 2 Dravyansh Sharma 2 3 Matthew Walter 2 1University of Illinois at Chicago 2Toyota Technological Institute at Chicago 3Northwestern University. Correspondence to: Keziah Naggita <EMAIL>, Donya Saless <EMAIL>.
Pseudocode	Yes	Below are the steps of the improvement algorithm we used to compute each agent s improvement features. Initialization: x (0) = xorig Iterative updates: For t = 0, 1, . . . , T 1: 1. Compute the gradient of the loss L with respect to the agent s updates x (t): g(t) = x (t)L h x (t) , h xorig 2. Update the improvement features by taking a step in the direction of the sign of the gradient: ( α sign(g(t)[i]), if i S 0, otherwise , i [d] x (t+1) = x (t) + ρ(t) 3. Project the updated improvement features back onto the r-ball around the original features xorig: x (t+1) = xorig + clip[ r,r](x (t+1) xorig) Improvement vector: After T iterations, the final agent s improvement is given by:
Open Source Code	Yes	Our code is publicly available here.
Open Datasets	Yes	We use three real-world tabular datasets: the Adult UCI dataset (Becker & Kohavi, 1996), the OULAD and Law School datasets (Le Quy et al., 2022a), and a synthetic 8-dimensional binary classification dataset with class separability 4 and minimal outliers, generated using Scikit-learn s make classification function (Pedregosa et al., 2011). In each case we train a zero-error model f on the entire dataset, which we treat as the true labeling function for our experiments. Let ST = {(x, y) \| x Rd, y {0, 1}} represent the dataset (e.g., Adult), where x is the feature vector and y = f (x) is the label. For all experiments, we split ST into training Strain (70%) and testing Stest (30%) subsets. Further dataset details, including improvement features and class distributions, are provided in Appendix E.1.
Dataset Splits	Yes	For all experiments, we split ST into training Strain (70%) and testing Stest (30%) subsets.
Hardware Specification	Yes	All experiments were conducted on a laptop computer with the following hardware specifications: 2.6-GHz 6-Core Intel Core i7 processor, 16 GB of 2400-MHz DDR4 RAM, and an Intel UHD Graphics 630 graphics card with 1536 MB of memory.
Software Dependencies	No	We trained two-layer neural networks, denoted as h functions, using Py Torch with Adam optimizer with a learning rate of 0.001 and a batch size of 64. These h functions generate decisions for the test set agents. In cases where the test agent receives a negative classification, they can, if within budget, improve their feature values to get the desired classification from the h function.
Experiment Setup	Yes	We trained two-layer neural networks, denoted as h functions, using Py Torch with Adam optimizer with a learning rate of 0.001 and a batch size of 64. These h functions generate decisions for the test set agents. In cases where the test agent receives a negative classification, they can, if within budget, improve their feature values to get the desired classification from the h function. Table 4 summarizes the performance metrics of the f and h model functions, demonstrating their varied performance across the datasets. Since the empirical setup evaluates the impact of improvement on h s error drop rates, we vary the loss functions we train the model h function with. We use the standard binary cross entropy loss (BCE) and the risk-averse weighted-BCE (w BCE) loss functions defined in Equation 5.