NOMU: Neural Optimization-based Model Uncertainty

Authors: Jakob M Heiss, Jakob Weissteiner, Hanna S Wutte, Sven Seuken, Josef Teichmann

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate NOMU in various regressions tasks and noiseless Bayesian optimization (BO) with costly evaluations. In regression, NOMU performs at least as well as state-of-the-art methods. In BO, NOMU even outperforms all considered benchmarks.
Researcher Affiliation Academia 1ETH Zurich 2ETH AI Center 3University of Zurich. Correspondence to: Jakob Weissteiner <weissteiner@ifi.uzh.ch>.
Pseudocode Yes Algorithm 1 greedily grows an ensemble among a given pre-defined set of models M, until some target size M is met, by selecting with-replacement the NN leading to the best improvement of a certain score S on a validation set. Algorithm 2 hyper deep ens (Wenzel et al., 2020b)
Open Source Code Yes Our source code is available on Git Hub: https://github.com /marketdesignresearch/NOMU.
Open Datasets Yes Finally, we evaluate NOMU on the real-world UCI data sets (Section 4.1.4). We consider the popular task of interpolating the solar irradiance data (Steinhilber et al., 2009) We consider ten different 1D functions whose graphs are shown in Figure 3. Those include the popular Levy and Forrester function with multiple local optima.
Dataset Splits Yes with a 70/20/10-train-validation-test split, equip NOMU with a shallow architecture of 50 hidden nodes, and train it for 400 epochs. Validation data are used to calibrate the constant c on NLL.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It only mentions software frameworks and general training parameters.
Software Dependencies No All NN-based methods are fully connected feed forward NNs with ReLU activation functions, implemented in TENSORFLOW.KERAS and trained for 210 epochs of TENSORFLOW.KERAS Adam stochastic gradient descent with standard learning rate 0.001 and full batch size of all training points. We use Gaussian Process Regressor from SCIKIT-LEARN. The paper mentions software such as TENSORFLOW.KERAS, Adam, and SCIKIT-LEARN, but does not specify their version numbers for reproducibility.
Experiment Setup Yes For each of the two NOMU subnetworks, we use a feed-forward NN with three fully-connected hidden layers a 210 nodes, ReLUs, and hyperparameters πexp = 0.01, πsqr = 0.1, cexp = 30. Moreover, we set λ = 10^-8 accounting for zero data-noise, ℓmin = 0.001 and ℓmax = 2. We train them on standard regularized mean squared error (MSE) with regularization parameter λ = 10^-8/ntrain chosen to represent the same data noise assumptions as NOMU. The MC dropout network is set up with three hidden layers as well, with 210, 211 and 210 nodes (resulting in 4 million parameters). Both training and prediction of this model is performed with constant dropout probability p := pi = 0.2, proposed in (Gal & Ghahramani, 2016). We perform 100 stochastic forward passes. For NOMU, we set πsqr = 1, ℓmin = 1e-6 and use l = 500 artificial input points for 5D, 10D and 20D. Otherwise we use the exact same hyperparameters as in 1D and 2D regression (see in Section 4.1 the paragraph Algorithm setup).