Hyperparameter Ensembles for Robustness and Uncertainty Quantification

Authors: Florian Wenzel, Jasper Snoek, Dustin Tran, Rodolphe Jenatton

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On image classification tasks, with MLP, Le Net, Res Net 20 and Wide Res Net 28-10 architectures, we improve upon both deep and batch ensembles. 5 Experiments Throughout the experiments, we use both metrics that depend on the predictive uncertainty negative log-likelihood (NLL) and expected calibration error (ECE) [55] and metrics that do not, e.g., the classification accuracy.
Researcher Affiliation Industry Florian Wenzel, Jasper Snoek, Dustin Tran, Rodolphe Jenatton Google Research {florianwenzel, jsnoek, trandustin, rjenatton}@google.com
Pseudocode Yes Algorithm 1: hyper_deep_ens(K, κ)
Open Source Code Yes The code for generic hyper-batch ensemble layers can be found in https://github.com/google/edward2 and the code to reproduce the experiments of Section 5.2 is part of https://github.com/google/uncertainty-baselines.
Open Datasets Yes 5 Experiments...MLP and Le Net [44], over CIFAR-100 [40] and Fashion MNIST [73]. Res Net-20 [31] and Wide Res Net 28-10 models [74] as they are simple architectures with competitive performance on image classification tasks. We consider six different L2 regularization hyperparameters... We show results on CIFAR-10, CIFAR-100 and corruptions on CIFAR-10 [33, 64].
Dataset Splits Yes C.1 gives all the details about the training, tuning and dataset definitions. The validation steps (6) and (9) seek to optimize the bounds of the ranges.
Hardware Specification No No specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running the experiments are provided. The paper mentions "Training time and memory cost" but does not specify the hardware used for these measurements.
Software Dependencies No No specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9) are provided. The paper mentions "https://github.com/google/edward2" and "https://github.com/google/uncertainty-baselines" which are code repositories, but does not list specific library versions required for reproducibility.
Experiment Setup Yes For both models, we add a dropout layer [66] before their last layer. For each pair of dataset/model type, we consider two tuning settings involving the dropout rate and different L2 regularizers defined with varied granularity, e.g., layerwise. Appendix C.1 gives all the details about the training, tuning and dataset definitions. In our experiments, we take Ωto be L2 regularizers applied to the parameters Wk(λk) and bk(λk) of each ensemble member. In practice, we use one sample of ΛK for each data point in the batch: for MLP/Le Net (Section 5.1), we use 256, while for Res Net-20/W. Res Net-28-10 (Section 5.2), we use 512 (64 for each of 8 workers).