Pearls from Pebbles: Improved Confidence Functions for Auto-labeling

Authors: Harit Vishwakarma, Yi Chen, Sui Jiet Tay, Satya Sai Srinath Namburi, Frederic Sala, Ramya Korlakai Vinayak

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform an extensive empirical evaluation of Colander and compare it against methods designed for calibration. Colander achieves up to 60% improvement on coverage over the baselines while maintaining error level below 5% and using the same amount of labeled data.
Researcher Affiliation Collaboration Harit Vishwakarma hvishwakarma@cs.wisc.edu University of Wisconsin-Madison Yi Chen yi.chen@wisc.edu University of Wisconsin-Madison Sui Jiet Tay st5494@nyu.edu NYU Courant Institute Satya Sai Srinath Namburi satya.namburi@gehealthcare.com GE Health Care Frederic Sala fredsala@cs.wisc.edu University of Wisconsin-Madison Ramya Korlakai Vinayak ramya@ece.wisc.edu University of Wisconsin-Madison
Pseudocode Yes See Algorithms 1, 2 and 3.
Open Source Code Yes Our code with instructions to run, is uploaded along with the paper. 2https://github.com/harit7/TBAL-Colander-Neur IPS-24
Open Datasets Yes MNIST [30] is a hand-written digits dataset. We use the Le Net [31] for auto-labeling. CIFAR-10 [24] is an image dataset with 10 classes. We use a CNN with approximately 5.8M parameters [20] for auto-labeling. Tiny-Image Net [29] is an image dataset comprising 100K images across 200 classes. We use CLIP [43] to derive embeddings for the images in the dataset and use an MLP model. 20 Newsgroups [34] is a natural language dataset comprising around 18K news posts across 20 topics. We use the Flag Embedding [58] to obtain text embeddings and use an MLP model.
Dataset Splits Yes Let D be some finite number of labeled samples, and then the empirical coverage and auto-labeling error are defined as follows, ... We randomly split the validation data into two parts Dcal and Dth and use Dcal to compute b P(g, t | h, Dcal) and b E(g, t | h, Dcal). ... We first randomly splits the validation data D(i) val into D(i) cal and D(i) th using procedure RANDOMSPLIT(D(i) val, ν).
Hardware Specification Yes Our experiments were conducted on machines equipped with the NVIDIA RTX A6000 and NVIDIA Ge Force RTX 4090 GPUs.
Software Dependencies No No specific version numbers for software dependencies are provided. For example, it mentions 'pytorch' but without a version.
Experiment Setup Yes We train the model to zero training error using minibatch SGD with learning rate 1e-3, weight decay 1e-3 [15, 25], momentum 0.9, and batch size 32. ... The hyperparameters and their values we swept over are listed in Table 9 and 10 for train-time and post-hoc methods, respectively.