Detecting Errors and Estimating Accuracy on Unlabeled Data with Self-training Ensembles

Authors: Jiefeng Chen, Frederick Liu, Besim Avci, Xi Wu, Yingyu Liang, Somesh Jha

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on 59 tasks over five dataset categories including image classification and sentiment classification datasets show that our method achieves state-of-the-art on both accuracy estimation and error detection (Section 7).
Researcher Affiliation Collaboration Jiefeng Chen Department of Computer Science University of Wisconsin-Madison Madison, WI 53706 jiefeng@cs.wisc.edu Frederick Liu Google Seattle, WA 98103 frederickliu@google.com Besim Avci Google Seattle, WA 98103 besim@google.com Xi Wu Google Madison, WI 53703 wuxi@google.com Yingyu Liang Department of Computer Science University of Wisconsin-Madison Madison, WI 53706 yliang@cs.wisc.edu Somesh Jha Department of Computer Science University of Wisconsin-Madison Madison, WI 53706 jha@cs.wisc.edu
Pseudocode Yes Framework 1 Error Detection and Unsupervised Accuracy Estimation via Self-Training Ensembles (Page 3) and Algorithm 1, 2, 3 (Pages 5-6) provide structured pseudo-code.
Open Source Code Yes Our code is available at: https://github.com/jfc43/self-training-ensembles.
Open Datasets Yes We use the following dataset categories: Digits (including MNIST [26], MNIST-M [12], SVHN [29], USPS [19]), Office-31 [33], CIFAR10-C [24], i Wild Cam [1] and Amazon Review [2].
Dataset Splits Yes For all image datasets, we use random split of 80% training data for training and 20% training data for validation.
Hardware Specification Yes For training, we use NVIDIA GPUs (e.g., V100 or A100).
Software Dependencies No The paper does not explicitly list software dependencies with version numbers, such as specific Python, PyTorch, or TensorFlow versions.
Experiment Setup Yes We train all models for 100 epochs with Adam optimizer, initial learning rate 1e-3, learning rate decay by 0.5 every 20 epochs, and batch size 64. ... In our experiments, we set T = 5 and N = 5 by considering the computational cost (on Amazon Review, we set N = 20). We set γ = 0.1 and set α following the domain adaptation methods.