reproducibilityindex.ai

Calibration tests beyond classification

Authors: David Widmann, Fredrik Lindsten, Dave Zachariah

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we propose the ﬁrst framework that uniﬁes calibration evaluation and tests for general probabilistic predictive models. It applies to any such model, including classiﬁcation and regression models of arbitrary dimension. Furthermore, the framework generalizes existing measures and provides a more intuitive reformulation of a recently proposed framework for calibration in multi-class classiﬁcation. In particular, we reformulate and generalize the kernel calibration error, its estimators, and hypothesis tests using scalar-valued kernels, and evaluate the calibration of real-valued regression problems.1
Researcher Affiliation	Academia	David Widmann Department of Information Technology Uppsala University, Sweden david.widmann@it.uu.se Fredrik Lindsten Division of Statistics and Machine Learning Linköping University, Sweden fredrik.lindsten@liu.se Dave Zachariah Department of Information Technology Uppsala University, Sweden dave.zachariah@it.uu.se
Pseudocode	No	The paper does not contain any sections explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present any structured algorithm blocks.
Open Source Code	Yes	The source code of the experiments is available at https://github.com/devmotion/Calibration_ICLR2021.
Open Datasets	Yes	The Friedman 1 regression problem (Friedman, 1979; 1991; Friedman et al., 1983) is a classic non-linear regression problem with ten-dimensional features and real-valued targets with Gaussian noise. We generate a training data set of 100 inputs distributed uniformly at random in the 10-dimensional unit hypercube and corresponding targets with identically and independently distributed noise following a standard normal distribution.
Dataset Splits	Yes	A validation data set of n = 50 i.i.d. pairs of X and Y is used to evaluate the empirical cumulative probability... We generate a training data set of 100 inputs... and a separate test data set. ... Evaluations on the training data set (100 samples) are displayed in green and orange, and on the test data set (50 samples) in blue and purple.
Hardware Specification	Yes	Computation time indicates the minimum time in the 500 evaluations on a computer with a 3.6 GHz processor.
Software Dependencies	No	The paper mentions using ADAM for optimization and the machine learning framework Flux.jl, but it does not specify version numbers for these software components or any other libraries.
Experiment Setup	Yes	We train a Gaussian predictive model whose mean is modelled by a shallow neural network and a single scalar variance parameter (consistent with the data-generating model) ten times with different initial parameters. ... We use a maximum likelihood approach and train the parameters θ of the model for 5000 iterations by minimizing the mean squared error on the training data set using ADAM... fully connected neural network with two hidden layers with 200 and 50 hidden units and ReLU activation functions. The initial values of the weight matrices of the neural networks are sampled from the uniform Glorot initialization ... and the offset vectors are initialized with zeros.