Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
The Human-AI Substitution game: active learning from a strategic labeler
Authors: Tom Yan, Chicheng Zhang
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | I EXPERIMENTS To supplement our theoretical minimax analysis in the main section, we examine the performance of three learning algorithms, E-VS bisection, VS-bisection and randomly query (a point), in averagecase settings by randomly generating learning instances. Experiment Setup: We consider five sizes for the hypothesis class ranging from 15 to 40. Given a particular hypothesis class size |H|, we generate 50 random learning instances by randomly generating the binary labels of hypotheses on examples x X, where the number of data points |X| is varied from 5 to 30. Given a learning instance, we consider setting (the underlying hypothesis) h to be every h H, and thus average the query complexity across random instances as well as across H. This is done to explore the average-case query complexity, where we do not focus on the query complexity of one particular h = h H (as was done in some of the worst-case analyses). We investigate two possible labeling strategies, with varying amounts of abstention p = 0.0, 0.15, 0.3, 0.45, 0.6. The first strategy is that given the underlying hypothesis h H, it abstains on labeling a point x with probability p, and outputs h (x) otherwise (w.p. 1 p). This labeling strategy may be viewed as one that abstains arbitrarily, and may compromise identifiability. This models the labeling strategy of a myopic labeler. The second strategy is a more careful, adaptive labeling strategy that always ensures identifiability. Given the underlying h , when x is queried, it computes the resultant E-VS if x was abstained upon. If abstention leads to non-identifiability, it labels x and returns h (x). Otherwise, it abstains with probability p and provides the label otherwise. This may be viewed as a more shrewd labeling strategy that always ensures identifiability, while using some abstention. Results: We plot results in Figure 3 and Figure 4, with Figure 3 corresponding to the first (random labeling) strategy and Figure 4 corresponding to the identifiable labeling strategy. |
| Researcher Affiliation | Academia | Tom Yan Machine Learning Department Carnegie Mellon University EMAIL Chicheng Zhang Department of Computer Science University of Arizona EMAIL |
| Pseudocode | Yes | Algorithm 2 E-VS Bisection Algorithm and Algorithm 3 Bisection Point Search Sub-routine are present on page 4. |
| Open Source Code | No | No explicit statement or link indicating the release of open-source code for the methodology described in the paper was found. |
| Open Datasets | No | The experiments use randomly generated learning instances and binary labels, not a publicly available or open dataset. |
| Dataset Splits | No | The paper describes experiments on randomly generated instances, but does not provide specific train/validation/test splits, cross-validation details, or predefined splits. |
| Hardware Specification | No | The paper describes the experiment setup (e.g., number of instances, hypothesis class sizes) but does not provide any specific details about the hardware used to run the experiments (e.g., GPU/CPU models, memory). |
| Software Dependencies | No | The paper does not mention any specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions) used in the experiments. |
| Experiment Setup | Yes | Experiment Setup: We consider five sizes for the hypothesis class ranging from 15 to 40. Given a particular hypothesis class size |H|, we generate 50 random learning instances by randomly generating the binary labels of hypotheses on examples x X, where the number of data points |X| is varied from 5 to 30. Given a learning instance, we consider setting (the underlying hypothesis) h to be every h H, and thus average the query complexity across random instances as well as across H. This is done to explore the average-case query complexity, where we do not focus on the query complexity of one particular h = h H (as was done in some of the worst-case analyses). We investigate two possible labeling strategies, with varying amounts of abstention p = 0.0, 0.15, 0.3, 0.45, 0.6. |