Hyperparameter Ensembles for Robustness and Uncertainty Quantification
Authors: Florian Wenzel, Jasper Snoek, Dustin Tran, Rodolphe Jenatton
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On image classification tasks, with MLP, Le Net, Res Net 20 and Wide Res Net 28-10 architectures, we improve upon both deep and batch ensembles. 5 Experiments Throughout the experiments, we use both metrics that depend on the predictive uncertainty negative log-likelihood (NLL) and expected calibration error (ECE) [55] and metrics that do not, e.g., the classification accuracy. |
| Researcher Affiliation | Industry | Florian Wenzel, Jasper Snoek, Dustin Tran, Rodolphe Jenatton Google Research {florianwenzel, jsnoek, trandustin, rjenatton}@google.com |
| Pseudocode | Yes | Algorithm 1: hyper_deep_ens(K, κ) |
| Open Source Code | Yes | The code for generic hyper-batch ensemble layers can be found in https://github.com/google/edward2 and the code to reproduce the experiments of Section 5.2 is part of https://github.com/google/uncertainty-baselines. |
| Open Datasets | Yes | 5 Experiments...MLP and Le Net [44], over CIFAR-100 [40] and Fashion MNIST [73]. Res Net-20 [31] and Wide Res Net 28-10 models [74] as they are simple architectures with competitive performance on image classification tasks. We consider six different L2 regularization hyperparameters... We show results on CIFAR-10, CIFAR-100 and corruptions on CIFAR-10 [33, 64]. |
| Dataset Splits | Yes | C.1 gives all the details about the training, tuning and dataset definitions. The validation steps (6) and (9) seek to optimize the bounds of the ranges. |
| Hardware Specification | No | No specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running the experiments are provided. The paper mentions "Training time and memory cost" but does not specify the hardware used for these measurements. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9) are provided. The paper mentions "https://github.com/google/edward2" and "https://github.com/google/uncertainty-baselines" which are code repositories, but does not list specific library versions required for reproducibility. |
| Experiment Setup | Yes | For both models, we add a dropout layer [66] before their last layer. For each pair of dataset/model type, we consider two tuning settings involving the dropout rate and different L2 regularizers defined with varied granularity, e.g., layerwise. Appendix C.1 gives all the details about the training, tuning and dataset definitions. In our experiments, we take Ωto be L2 regularizers applied to the parameters Wk(λk) and bk(λk) of each ensemble member. In practice, we use one sample of ΛK for each data point in the batch: for MLP/Le Net (Section 5.1), we use 256, while for Res Net-20/W. Res Net-28-10 (Section 5.2), we use 512 (64 for each of 8 workers). |