Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Delegated Classification

Authors: Eden Saig, Inbal Talgam-Cohen, Nir Rosenfeld

NeurIPS 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we demonstrate that budget-optimal contracts can be constructed using small-scale data, leveraging recent advances in the study of learning curves and scaling laws. Performance and economic outcomes are evaluated using synthetic and real-world classification tasks. 4 Experiments
Researcher Affiliation	Academia	Eden Saig, Inbal Talgam-Cohen, Nir Rosenfeld Technion Israel Institute of Technology Haifa, Israel EMAIL
Pseudocode	No	The paper describes the 'single binding action (SBA) algorithm' in text but does not include a structured pseudocode block or algorithm box.
Open Source Code	Yes	Code is available at: https://github.com/edensaig/delegated-classification.
Open Datasets	Yes	We base our experiments on the recently curated Learning Curves Database (LCDB) [43], which includes a large collection of stochastic learning curves for multiple classification datasets and methods. Here we focus primarily on the popular MNIST dataset [39] as our case study...
Dataset Splits	Yes	expected performance is estimated by the empirical average on an additional held-out validation set V Dm of size m, as acc V (h) = 1 m Pm i=1 1 [h(xi) = yi], which is a consistent and unbiased estimator of acc D(h). For each trained classifier, each accuracy point on the learning curve is estimated using 5,000 held-out samples.
Hardware Specification	Yes	All experiments were run on a single laptop, with 16GB of RAM, M1 Pro processor, and with no GPU support.
Software Dependencies	No	The paper mentions software like Pyomo, GLPK, and scikit-learn, but does not specify their version numbers for reproducibility.
Experiment Setup	Yes	action costs are set to fixed per-unit cost, i.e., cn = n; and (iii) the distribution F over outcomes Ωis associated with a binomial mixture distribtuion, resulting from applying bootstrap sampling to empirical error measurements: 1 R P r=1 Binomial(m, ar,Alg,D n ). In particular, we experiment with fitting parametric power-law curves of the form E[αn] = a bn c, which have been shown to provide good fit in various scenarios both empirically and theoretically [49, 34, 11]. We define r as the number of samples per n (so low r means larger n0). Then, for a given r, we set n0 such that Pn n0 r n k (i.e., such that the total number of used samples does not exceed k).