Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Statistical Analysis of an Adversarial Bayesian Weak Supervision Method

Authors: Steven An

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments compare our proposed method against twelve baseline label models over eleven datasets. BBF compares favorably to other Bayesian label models and label models that don t use datapoint features matching or exceeding their performance on eight out of eleven datasets.
Researcher Affiliation	Academia	Steven An Computer Science Department University of California, San Diego La Jolla, CA 92037 EMAIL
Pseudocode	No	The paper describes 'BBF Generative Process' and 'One-Coin i BCC Generative Process' with numbered steps, but these are descriptions of processes and not explicitly labeled or formatted as pseudocode or algorithm blocks.
Open Source Code	Yes	One may find the code in the supplementary materials or at https://github.com/stevenan5/ bayesian-bf-neurips-2025.
Open Datasets	Yes	We used eleven datasets from WRENCH, which also provided the LF predictions. The following datasets had licenses we could find online and were different from WRENCH s license: You Tube, SMS (CC BY 4.0), Sem Eval (CC BY 3.0). For FABLE, Denoise, Wea SEL, we use the original datapoint features, unless they were textual, in which case Ro BERTa [Liu et al., 2019] was used to extract them. The methods are evaluated in a transductive setting. The provided train/validation/test Table 1: Some dataset statistics. Note that we only count datapoints with at least one LF prediction. Dataset IMDB Youtube SMS CDR Yelp Commercial Tennis TREC Sem Eval Chem Prot AG News
Dataset Splits	Yes	The provided train/validation/test splits are combined and points with no LF predictions are removed. Each label model is given LF predictions on the datapoints (and the datapoint features themselves if used).
Hardware Specification	Yes	Experiments were run in Python 3.6.13 (PSF) [Van Rossum and Drake, 2009] mainly using Num Py 1.19.5 (modified BSD) [Harris et al., 2020] with an AMD Ryzen R9 5950x, 128GB RAM, and an Nvidia RTX 2080Ti.
Software Dependencies	Yes	Experiments were run in Python 3.6.13 (PSF) [Van Rossum and Drake, 2009] mainly using Num Py 1.19.5 (modified BSD) [Harris et al., 2020] with an AMD Ryzen R9 5950x, 128GB RAM, and an Nvidia RTX 2080Ti. One may find the code in the supplementary materials or at https://github.com/stevenan5/ bayesian-bf-neurips-2025. To compute the BBF prediction, we optimize Equation 5 using CVXPY 1.4.1 (Apache 2.0) [Diamond and Boyd, 2016] by way of MOSEK 10.1.16 (personal academic license) [MOSEK Ap S, 2022] with default parameters.
Experiment Setup	Yes	For our method BBF, we set the prior hyperparameters as α = 1K, ρw = (4, 1) for each w [W]. This follows the initialization of [Li et al., 2019] for their Bayesian method, i.e. we assume each LF makes 4 correct predictions and 1 wrong prediction. Note that there are no requirements on the LF accuracies for BBF. Across all LFs used in the experiments, the accuracies range from 0.0366 to 1. To compute the BBF prediction, we optimize Equation 5 using CVXPY 1.4.1 (Apache 2.0) [Diamond and Boyd, 2016] by way of MOSEK 10.1.16 (personal academic license) [MOSEK Ap S, 2022] with default parameters. For all other methods that require initialization, we use the defaults provided in WRENCH.