Weak Supervision Performance Evaluation via Partial Identification
Authors: Felipe Maia Polo, Subha Maity, Mikhail Yurochkin, Moulinath Banerjee, Yuekai Sun
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through scalable convex optimization, we obtain accurate and computationally efficient bounds for metrics including accuracy, precision, recall, and F1-score, even in high-dimensional settings. This framework offers a robust approach to assessing model quality without ground truth labels, enhancing the practicality of weakly supervised learning for real-world applications.2 |
| Researcher Affiliation | Collaboration | Felipe Maia Polo Department of Statistics University of Michigan Subha Maity Department of Statistics and Actuarial Science University of Waterloo Mikhail Yurochkin MIT-IBM Watson AI Lab Moulinath Banerjee Department of Statistics University of Michigan Yuekai Sun Department of Statistics University of Michigan |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | 2Our code can be found on https://github.com/felipemaiapolo/wsbounds |
| Open Datasets | Yes | Wrench datasets: To carry out realistic experiments within the weak supervision setup and study accuracy/F1 score estimation, we utilize datasets incorporated in Wrench (Weak Supervision Benchmark) [63]. This standardized benchmark platform features real-world datasets and pregenerated weak labels for evaluating weak supervision methodologies. Most of Wrench s datasets are designed for classification tasks, encompassing diverse data types such as tabular, text, and image; all contain their pre-computed weak labels. Specifically, we utilize Census [27], You Tube [1], SMS [2], IMDB [37], Yelp [65], AGNews [65], TREC [31], Spouse [12], Sem Eval [24], CDR [14], Chem Prot [29], Commercial [22], Tennis Rally [22], Basketball [22]. For text datasets, we employ the paraphrase-Mini LM-L6-v2 model from the sentence-transformers6 library for feature extraction [51]. Features were extracted for the image datasets before their inclusion in Wrench. Hate Speech Dataset [15]: This dataset contains sentence-level annotations for hate speech in English, sourced from posts from white supremacy forums. |
| Dataset Splits | No | All experiments are structured to emulate conditions where high-quality labels are inaccessible during training, validation, and testing phases... To fit the label models, we assume PY is known (computed using the training set). |
| Hardware Specification | No | All experiments were conducted using a virtual machine with 32 cores. |
| Software Dependencies | No | Unless stated, we use l2-regularized logistic regressors as classifiers... employ the paraphrase-Mini LM-L6-v2 model from the sentence-transformers6 library for feature extraction [51]... Snorkel s [48, 47] default label model... Adam [26]... It mentions libraries and tools like `sentence-transformers`, `Snorkel`, and `Adam`, but generally without specific version numbers for these software components. |
| Experiment Setup | Yes | Unless stated, we use l2-regularized logistic regressors as classifiers, where the regularization strength is determined according to the validation noise-aware loss... The considered MLPs have one hidden layer with a possible number of neurons in {50, 100}. Training is carried out with Adam [26], with possible learning rates in {.1, .001} and weight decay (l2 regularization parameter) in {.1, .001}. For those datasets that use the F1 score as the evaluation metric, we also tune the classification threshold in {.2, .4, .5, .6, .8} (otherwise, they return the most probable class as a prediction). |