reproducibilityindex.ai

Weak Supervision Performance Evaluation via Partial Identification

Authors: Felipe Maia Polo, Subha Maity, Mikhail Yurochkin, Moulinath Banerjee, Yuekai Sun

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through scalable convex optimization, we obtain accurate and computationally efficient bounds for metrics including accuracy, precision, recall, and F1-score, even in high-dimensional settings. This framework offers a robust approach to assessing model quality without ground truth labels, enhancing the practicality of weakly supervised learning for real-world applications.2
Researcher Affiliation	Collaboration	Felipe Maia Polo Department of Statistics University of Michigan Subha Maity Department of Statistics and Actuarial Science University of Waterloo Mikhail Yurochkin MIT-IBM Watson AI Lab Moulinath Banerjee Department of Statistics University of Michigan Yuekai Sun Department of Statistics University of Michigan
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	2Our code can be found on https://github.com/felipemaiapolo/wsbounds
Open Datasets	Yes	Wrench datasets: To carry out realistic experiments within the weak supervision setup and study accuracy/F1 score estimation, we utilize datasets incorporated in Wrench (Weak Supervision Benchmark) [63]. This standardized benchmark platform features real-world datasets and pregenerated weak labels for evaluating weak supervision methodologies. Most of Wrench s datasets are designed for classification tasks, encompassing diverse data types such as tabular, text, and image; all contain their pre-computed weak labels. Specifically, we utilize Census [27], You Tube [1], SMS [2], IMDB [37], Yelp [65], AGNews [65], TREC [31], Spouse [12], Sem Eval [24], CDR [14], Chem Prot [29], Commercial [22], Tennis Rally [22], Basketball [22]. For text datasets, we employ the paraphrase-Mini LM-L6-v2 model from the sentence-transformers6 library for feature extraction [51]. Features were extracted for the image datasets before their inclusion in Wrench. Hate Speech Dataset [15]: This dataset contains sentence-level annotations for hate speech in English, sourced from posts from white supremacy forums.
Dataset Splits	No	All experiments are structured to emulate conditions where high-quality labels are inaccessible during training, validation, and testing phases... To fit the label models, we assume PY is known (computed using the training set).
Hardware Specification	No	All experiments were conducted using a virtual machine with 32 cores.
Software Dependencies	No	Unless stated, we use l2-regularized logistic regressors as classifiers... employ the paraphrase-Mini LM-L6-v2 model from the sentence-transformers6 library for feature extraction [51]... Snorkel s [48, 47] default label model... Adam [26]... It mentions libraries and tools like `sentence-transformers`, `Snorkel`, and `Adam`, but generally without specific version numbers for these software components.
Experiment Setup	Yes	Unless stated, we use l2-regularized logistic regressors as classifiers, where the regularization strength is determined according to the validation noise-aware loss... The considered MLPs have one hidden layer with a possible number of neurons in {50, 100}. Training is carried out with Adam [26], with possible learning rates in {.1, .001} and weight decay (l2 regularization parameter) in {.1, .001}. For those datasets that use the F1 score as the evaluation metric, we also tune the classification threshold in {.2, .4, .5, .6, .8} (otherwise, they return the most probable class as a prediction).