Calibration tests beyond classification
Authors: David Widmann, Fredrik Lindsten, Dave Zachariah
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we propose the first framework that unifies calibration evaluation and tests for general probabilistic predictive models. It applies to any such model, including classification and regression models of arbitrary dimension. Furthermore, the framework generalizes existing measures and provides a more intuitive reformulation of a recently proposed framework for calibration in multi-class classification. In particular, we reformulate and generalize the kernel calibration error, its estimators, and hypothesis tests using scalar-valued kernels, and evaluate the calibration of real-valued regression problems.1 |
| Researcher Affiliation | Academia | David Widmann Department of Information Technology Uppsala University, Sweden david.widmann@it.uu.se Fredrik Lindsten Division of Statistics and Machine Learning Linköping University, Sweden fredrik.lindsten@liu.se Dave Zachariah Department of Information Technology Uppsala University, Sweden dave.zachariah@it.uu.se |
| Pseudocode | No | The paper does not contain any sections explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present any structured algorithm blocks. |
| Open Source Code | Yes | The source code of the experiments is available at https://github.com/devmotion/Calibration_ICLR2021. |
| Open Datasets | Yes | The Friedman 1 regression problem (Friedman, 1979; 1991; Friedman et al., 1983) is a classic non-linear regression problem with ten-dimensional features and real-valued targets with Gaussian noise. We generate a training data set of 100 inputs distributed uniformly at random in the 10-dimensional unit hypercube and corresponding targets with identically and independently distributed noise following a standard normal distribution. |
| Dataset Splits | Yes | A validation data set of n = 50 i.i.d. pairs of X and Y is used to evaluate the empirical cumulative probability... We generate a training data set of 100 inputs... and a separate test data set. ... Evaluations on the training data set (100 samples) are displayed in green and orange, and on the test data set (50 samples) in blue and purple. |
| Hardware Specification | Yes | Computation time indicates the minimum time in the 500 evaluations on a computer with a 3.6 GHz processor. |
| Software Dependencies | No | The paper mentions using ADAM for optimization and the machine learning framework Flux.jl, but it does not specify version numbers for these software components or any other libraries. |
| Experiment Setup | Yes | We train a Gaussian predictive model whose mean is modelled by a shallow neural network and a single scalar variance parameter (consistent with the data-generating model) ten times with different initial parameters. ... We use a maximum likelihood approach and train the parameters θ of the model for 5000 iterations by minimizing the mean squared error on the training data set using ADAM... fully connected neural network with two hidden layers with 200 and 50 hidden units and ReLU activation functions. The initial values of the weight matrices of the neural networks are sampled from the uniform Glorot initialization ... and the offset vectors are initialized with zeros. |