Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

A Simple Approach to Improve Single-Model Deep Uncertainty via Distance-Awareness

Authors: Jeremiah Zhe Liu, Shreyas Padhy, Jie Ren, Zi Lin, Yeming Wen, Ghassen Jerfel, Zachary Nado, Jasper Snoek, Dustin Tran, Balaji Lakshminarayanan

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we benchmark the performance of the SNGP model by applying it on a variety of both toy datasets and real-world tasks across different modalities Speciﬁcally, we ﬁrst benchmark the behavior of the approximate GP layer versus an exact GP, and illustrate the impact of spectral normalization on toy regression and classiﬁcation tasks (Section 6.1). We then conduct a thorough benchmark study to compare the performance of SNGP against the other state-of-the-art methods on popular benchmarks such as CIFAR-10 and CIFAR-100 (Section 6.2.1. Finally, we illustrate the scalability of and the generality of the SNGP approach by applying it to a large-scale image recognition task (Image Net, Section 6.2.2), and highlight the broad usefulness by applying SNGP to uncertainty tasks in two other data modalities, namely conversational intent understanding and genomics sequence identiﬁcation (Section 6.3).
Researcher Affiliation	Collaboration	Jeremiah Zhe Liu EMAIL Shreyas Padhy EMAIL Jie Ren EMAIL Zi Lin EMAIL Yeming Wen EMAIL Ghassen Jerfel EMAIL Zachary Nado EMAIL Jasper Snoek EMAIL Dustin Tran EMAIL Balaji Lakshminarayanan EMAIL Google Research, Mountain View, CA 94043, USA
Pseudocode	Yes	Algorithm 1 SNGP Training 1: Input: Minibatches {Di}N i=1 for Di = {ym,xm}M m=1. 2: Initialize: ˆΣ = τ I,WL iid N(0,1),b L iid U(0,2π). 3: for train step = 1 to max step do 4: SGD update n β,{Wl}L 1 l=1 ,{bl}L 1 l=1 o (14) 5: if ﬁnal epoch then 6: Update precision matrix ˆΣ 1 (13). 9: Compute posterior covariance ˆΣ = inv(ˆΣ 1). Algorithm 2 SNGP Prediction 1: Input: Testing example x. 2: Compute Features: 2σ2/DL cos(WLh(x)+b L) 3: Compute Posterior Mean: logit(x) = Φ β 4: Compute Posterior Variance: var(x) = Φ ˆΣΦ. 5: Compute Predictive posterior distribution: g N(logit(x),var(x)) sigmoid(g)dg
Open Source Code	Yes	Code is open-sourced at https://github.com/google/uncertainty-baselines.
Open Datasets	Yes	On a suite of vision and language understanding benchmarks and on modern architectures (Wide-Res Net and BERT), SNGP consistently outperforms other single-model approaches in prediction, calibration and out-of-domain detection. For the CIFAR-10 and CIFAR-100 image classiﬁcation benchmarks, we use a Wide Res Net 28-10 model as the base for all methods (Zagoruyko and Komodakis, 2017). To evaluate the model s OOD detection performance, we consider two tasks: a standard far-OOD task using SVHN as the OOD dataset for a model trained on CIFAR-10/-100, and a difﬁcult near-OOD task using CIFAR-100 as the OOD dataset for a model trained on CIFAR-10, and vice versa. We illustrate the scalability of SNGP by experimenting on the large-scale Image Net dataset (Russakovsky et al., 2015) To this end, we consider training an intent understanding model using the CLINC out-of-scope (OOS) intent detection benchmark dataset (Larson et al., 2019). Ren et al. (2019) proposed the genomics OOD benchmark dataset motivated by the real-world problem of bacteria identiﬁcation based on genomic sequences
Dataset Splits	Yes	Following the benchmarking setup ﬁrst suggested in Ovadia et al. (2019), we evaluate the model s predictive accuracy, negative log-likelihood (NLL), and expected calibration error (ECE) under both clean CIFAR testing data and its corrupted versions termed CIFAR-*-C (Hendrycks and Dietterich, 2018). For this complex, large-scale task with high-dimensional output, we ﬁnd it difﬁcult to scale some of the other single-model methods (e.g., Iso Max or DUQ), which tend to over-constrain the model expressiveness and lead to lower accuracy than a baseline DNN (a phenomenon we already observe in the CIFAR-100 experiment). Brieﬂy, the OOS dataset contains data for 150 in-domain services with 150 training sentences in each domain, and also 1500 natural out-of-domain utterances. The genomic OOD benchmark dataset contains 10 bacteria classes as in-distribution for training, and 60 bacteria classes as out-of-distribution for testing. Due to the large size of test OOD dataset, we randomly select 100,000 OOD samples to pair with the same number of in-distribution samples.
Hardware Specification	Yes	All models are implemented in Tensor Flow and are trained on 8-core Cloud TPU v2 with 8 GiB of high-bandwidth memory (HBM) for each TPU core. We use batch size 32 per core.
Software Dependencies	No	All models are implemented in Tensor Flow and are trained on 8-core Cloud TPU v2 with 8 GiB of high-bandwidth memory (HBM) for each TPU core. We use batch size 32 per core. SNGP is composed of two components: Spectral Normalization (SN) and Gaussian Process (GP) layer, both are available at the open-source edward2 probabilistic programming library. For CLINC OOS intent understanding data, we pre-tokenized the sentences using the standard BERT tokenizer with maximum sequence length 32, and created standard binary input mask for the BERT model that returns 1 for valid tokens and 0 otherwise. Following the original BERT work, we used the Adam optimizer with weight decay rate 0.01 and warmup proportion 0.1.
Experiment Setup	Yes	For CIFAR-10 and CIFAR-100, we followed the original Wide Res Net work to apply the standard data augmentation (horizontal ﬂips and random crop-ping with 4x4 padding) and used the same hyperparameter and training setup (Zagoruyko and Komodakis, 2017). The only exception is the learning rate and training epochs, where we ﬁnd a smaller learning rate (0.04 for CIFAR-10 and 0.08 for CIFAR100, v.s. 0.1 for the original WRN model) and longer epochs (250 for SNGP v.s. 200 for the original WRN model) leads to better performance. For CLINC OOS intent understanding data, we pre-tokenized the sentences using the standard BERT tokenizer14 with maximum sequence length 32, and created standard binary input mask for the BERT model that returns 1 for valid tokens and 0 otherwise. Following the original BERT work, we used the Adam optimizer with weight decay rate 0.01 and warmup proportion 0.1. We initialize the model from the ofﬁcial BERTBase checkpoint15. For this ﬁne-tuning task, we using a smaller step size (5e 5 for SNGP .v.s. 1e 4 for the original BERT model) but shorter epochs (40 for SNGP v.s. 150 for the original BERT model) leads to better performance. When using spectral normalization, we set the hyperparameter c = 0.95 and apply it to the pooler dense layer of the classiﬁcation token. For Genomics sequence data, we consider a 1D CNN model following the prior work (Ren et al., 2019). Speciﬁcally, the model is composed by one convolutional layer of 2,000 ﬁlters of length 20, one max-pooling layer, one dense layer of 2,000 units, and a ﬁnal dense layer with softmax activation for predicting class probabilities. The model is trained using the batch size 128, the learning rate 1e-4, and Adam optimizer. The training step is set for 1 million, but we choose the best step when validation loss is at the lowest value.