Pearls from Pebbles: Improved Confidence Functions for Auto-labeling
Authors: Harit Vishwakarma, Yi Chen, Sui Jiet Tay, Satya Sai Srinath Namburi, Frederic Sala, Ramya Korlakai Vinayak
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform an extensive empirical evaluation of Colander and compare it against methods designed for calibration. Colander achieves up to 60% improvement on coverage over the baselines while maintaining error level below 5% and using the same amount of labeled data. |
| Researcher Affiliation | Collaboration | Harit Vishwakarma hvishwakarma@cs.wisc.edu University of Wisconsin-Madison Yi Chen yi.chen@wisc.edu University of Wisconsin-Madison Sui Jiet Tay st5494@nyu.edu NYU Courant Institute Satya Sai Srinath Namburi satya.namburi@gehealthcare.com GE Health Care Frederic Sala fredsala@cs.wisc.edu University of Wisconsin-Madison Ramya Korlakai Vinayak ramya@ece.wisc.edu University of Wisconsin-Madison |
| Pseudocode | Yes | See Algorithms 1, 2 and 3. |
| Open Source Code | Yes | Our code with instructions to run, is uploaded along with the paper. 2https://github.com/harit7/TBAL-Colander-Neur IPS-24 |
| Open Datasets | Yes | MNIST [30] is a hand-written digits dataset. We use the Le Net [31] for auto-labeling. CIFAR-10 [24] is an image dataset with 10 classes. We use a CNN with approximately 5.8M parameters [20] for auto-labeling. Tiny-Image Net [29] is an image dataset comprising 100K images across 200 classes. We use CLIP [43] to derive embeddings for the images in the dataset and use an MLP model. 20 Newsgroups [34] is a natural language dataset comprising around 18K news posts across 20 topics. We use the Flag Embedding [58] to obtain text embeddings and use an MLP model. |
| Dataset Splits | Yes | Let D be some finite number of labeled samples, and then the empirical coverage and auto-labeling error are defined as follows, ... We randomly split the validation data into two parts Dcal and Dth and use Dcal to compute b P(g, t | h, Dcal) and b E(g, t | h, Dcal). ... We first randomly splits the validation data D(i) val into D(i) cal and D(i) th using procedure RANDOMSPLIT(D(i) val, ν). |
| Hardware Specification | Yes | Our experiments were conducted on machines equipped with the NVIDIA RTX A6000 and NVIDIA Ge Force RTX 4090 GPUs. |
| Software Dependencies | No | No specific version numbers for software dependencies are provided. For example, it mentions 'pytorch' but without a version. |
| Experiment Setup | Yes | We train the model to zero training error using minibatch SGD with learning rate 1e-3, weight decay 1e-3 [15, 25], momentum 0.9, and batch size 32. ... The hyperparameters and their values we swept over are listed in Table 9 and 10 for train-time and post-hoc methods, respectively. |