Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
On Truthing Issues in Supervised Classification
Authors: Jonathan K. Su
JMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate the effectiveness of the methods and confirm the implication. We conducted a number of experiments to see how the different testing and training methods performed and to check the implication of equivalent mutual information for different combinations of labelers. |
| Researcher Affiliation | Academia | Jonathan K. Su EMAIL MIT Lincoln Laboratory 244 Wood Street Lexington, MA 02421-6426, USA |
| Pseudocode | Yes | Algorithm 1 MMSE testing with empirical Bayes estimation of ( p D, p FA) via ratios of jointly normal RVs. Algorithm 2 MMSE testing with empirical Bayes estimation of ( p D, p FA) via sampling. Algorithm 3 Suboptimal estimation of ( p D, p FA) by estimating the correct-label RVs Y. Algorithm 4 MMSE testing for multi-class classification with empirical Bayes estimation of K via sampling. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. It only provides a license for the paper itself and attribution requirements for the paper content, not for any accompanying code. |
| Open Datasets | Yes | We use the Ionosphere binary-classification data set from the UCI Machine Learning Repository (see Dua and Graff, 2017) |
| Dataset Splits | Yes | We employ 75% 25% stratified hold-out validation since multi-fold cross-validation produced cluttered plots that were too difficult to read. |
| Hardware Specification | No | The paper does not provide specific hardware details used for running its experiments. It describes the simulation settings and algorithms but no information about the computational resources. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers needed to replicate the experiments. It mentions using L2 regularization and the Broyden-Fletcher-Goldfarb-Shanno method for training but no specific software libraries or their versions. |
| Experiment Setup | Yes | The settings were δi ∼ Beta(1, 5), ∀i; φt ∼ U(0, 0.5), ∀t; η1 = 1 to force the first labeler to label every sample; and ηt ∼ U(0.33, 1), ∀t ∈ T \ {1}. ... For each training method, the regularization weight λ was swept over {0.5, 1.0, ..., 10.0}, producing twenty trained models. ... This section presents results for the single default threshold τ = 1/2. |