reproducibilityindex.ai

Detecting Errors and Estimating Accuracy on Unlabeled Data with Self-training Ensembles

Authors: Jiefeng Chen, Frederick Liu, Besim Avci, Xi Wu, Yingyu Liang, Somesh Jha

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on 59 tasks over ﬁve dataset categories including image classiﬁcation and sentiment classiﬁcation datasets show that our method achieves state-of-the-art on both accuracy estimation and error detection (Section 7).
Researcher Affiliation	Collaboration	Jiefeng Chen Department of Computer Science University of Wisconsin-Madison Madison, WI 53706 jiefeng@cs.wisc.edu Frederick Liu Google Seattle, WA 98103 frederickliu@google.com Besim Avci Google Seattle, WA 98103 besim@google.com Xi Wu Google Madison, WI 53703 wuxi@google.com Yingyu Liang Department of Computer Science University of Wisconsin-Madison Madison, WI 53706 yliang@cs.wisc.edu Somesh Jha Department of Computer Science University of Wisconsin-Madison Madison, WI 53706 jha@cs.wisc.edu
Pseudocode	Yes	Framework 1 Error Detection and Unsupervised Accuracy Estimation via Self-Training Ensembles (Page 3) and Algorithm 1, 2, 3 (Pages 5-6) provide structured pseudo-code.
Open Source Code	Yes	Our code is available at: https://github.com/jfc43/self-training-ensembles.
Open Datasets	Yes	We use the following dataset categories: Digits (including MNIST [26], MNIST-M [12], SVHN [29], USPS [19]), Ofﬁce-31 [33], CIFAR10-C [24], i Wild Cam [1] and Amazon Review [2].
Dataset Splits	Yes	For all image datasets, we use random split of 80% training data for training and 20% training data for validation.
Hardware Specification	Yes	For training, we use NVIDIA GPUs (e.g., V100 or A100).
Software Dependencies	No	The paper does not explicitly list software dependencies with version numbers, such as specific Python, PyTorch, or TensorFlow versions.
Experiment Setup	Yes	We train all models for 100 epochs with Adam optimizer, initial learning rate 1e-3, learning rate decay by 0.5 every 20 epochs, and batch size 64. ... In our experiments, we set T = 5 and N = 5 by considering the computational cost (on Amazon Review, we set N = 20). We set γ = 0.1 and set α following the domain adaptation methods.