Detecting Errors and Estimating Accuracy on Unlabeled Data with Self-training Ensembles
Authors: Jiefeng Chen, Frederick Liu, Besim Avci, Xi Wu, Yingyu Liang, Somesh Jha
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on 59 tasks over five dataset categories including image classification and sentiment classification datasets show that our method achieves state-of-the-art on both accuracy estimation and error detection (Section 7). |
| Researcher Affiliation | Collaboration | Jiefeng Chen Department of Computer Science University of Wisconsin-Madison Madison, WI 53706 jiefeng@cs.wisc.edu Frederick Liu Google Seattle, WA 98103 frederickliu@google.com Besim Avci Google Seattle, WA 98103 besim@google.com Xi Wu Google Madison, WI 53703 wuxi@google.com Yingyu Liang Department of Computer Science University of Wisconsin-Madison Madison, WI 53706 yliang@cs.wisc.edu Somesh Jha Department of Computer Science University of Wisconsin-Madison Madison, WI 53706 jha@cs.wisc.edu |
| Pseudocode | Yes | Framework 1 Error Detection and Unsupervised Accuracy Estimation via Self-Training Ensembles (Page 3) and Algorithm 1, 2, 3 (Pages 5-6) provide structured pseudo-code. |
| Open Source Code | Yes | Our code is available at: https://github.com/jfc43/self-training-ensembles. |
| Open Datasets | Yes | We use the following dataset categories: Digits (including MNIST [26], MNIST-M [12], SVHN [29], USPS [19]), Office-31 [33], CIFAR10-C [24], i Wild Cam [1] and Amazon Review [2]. |
| Dataset Splits | Yes | For all image datasets, we use random split of 80% training data for training and 20% training data for validation. |
| Hardware Specification | Yes | For training, we use NVIDIA GPUs (e.g., V100 or A100). |
| Software Dependencies | No | The paper does not explicitly list software dependencies with version numbers, such as specific Python, PyTorch, or TensorFlow versions. |
| Experiment Setup | Yes | We train all models for 100 epochs with Adam optimizer, initial learning rate 1e-3, learning rate decay by 0.5 every 20 epochs, and batch size 64. ... In our experiments, we set T = 5 and N = 5 by considering the computational cost (on Amazon Review, we set N = 20). We set γ = 0.1 and set α following the domain adaptation methods. |