reproducibilityindex.ai

Hyperparameter Ensembles for Robustness and Uncertainty Quantification

Authors: Florian Wenzel, Jasper Snoek, Dustin Tran, Rodolphe Jenatton

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On image classiﬁcation tasks, with MLP, Le Net, Res Net 20 and Wide Res Net 28-10 architectures, we improve upon both deep and batch ensembles. 5 Experiments Throughout the experiments, we use both metrics that depend on the predictive uncertainty negative log-likelihood (NLL) and expected calibration error (ECE) [55] and metrics that do not, e.g., the classiﬁcation accuracy.
Researcher Affiliation	Industry	Florian Wenzel, Jasper Snoek, Dustin Tran, Rodolphe Jenatton Google Research {florianwenzel, jsnoek, trandustin, rjenatton}@google.com
Pseudocode	Yes	Algorithm 1: hyper_deep_ens(K, κ)
Open Source Code	Yes	The code for generic hyper-batch ensemble layers can be found in https://github.com/google/edward2 and the code to reproduce the experiments of Section 5.2 is part of https://github.com/google/uncertainty-baselines.
Open Datasets	Yes	5 Experiments...MLP and Le Net [44], over CIFAR-100 [40] and Fashion MNIST [73]. Res Net-20 [31] and Wide Res Net 28-10 models [74] as they are simple architectures with competitive performance on image classiﬁcation tasks. We consider six different L2 regularization hyperparameters... We show results on CIFAR-10, CIFAR-100 and corruptions on CIFAR-10 [33, 64].
Dataset Splits	Yes	C.1 gives all the details about the training, tuning and dataset deﬁnitions. The validation steps (6) and (9) seek to optimize the bounds of the ranges.
Hardware Specification	No	No specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running the experiments are provided. The paper mentions "Training time and memory cost" but does not specify the hardware used for these measurements.
Software Dependencies	No	No specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9) are provided. The paper mentions "https://github.com/google/edward2" and "https://github.com/google/uncertainty-baselines" which are code repositories, but does not list specific library versions required for reproducibility.
Experiment Setup	Yes	For both models, we add a dropout layer [66] before their last layer. For each pair of dataset/model type, we consider two tuning settings involving the dropout rate and different L2 regularizers deﬁned with varied granularity, e.g., layerwise. Appendix C.1 gives all the details about the training, tuning and dataset deﬁnitions. In our experiments, we take Ωto be L2 regularizers applied to the parameters Wk(λk) and bk(λk) of each ensemble member. In practice, we use one sample of ΛK for each data point in the batch: for MLP/Le Net (Section 5.1), we use 256, while for Res Net-20/W. Res Net-28-10 (Section 5.2), we use 512 (64 for each of 8 workers).