reproducibilityindex.ai

Exploring the Limits of Out-of-Distribution Detection

Authors: Stanislav Fort, Jie Ren, Balaji Lakshminarayanan

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that large-scale pre-trained transformers can signiﬁcantly improve the state-of-the-art (SOTA) on a range of near OOD tasks across different data modalities. For instance, on CIFAR-100 vs CIFAR-10 OOD detection, we improve the AUROC from 85% (current SOTA) to 96% using Vision Transformers pre-trained on Image Net-21k. On a challenging genomics OOD detection benchmark, we improve the AUROC from 66% to 77% using transformers and unsupervised pre-training.
Researcher Affiliation	Collaboration	Stanislav Fort Stanford University sfort1@stanford.edu Jie Ren Google Research, Brain Team jjren@google.com Balaji Lakshminarayanan Google Research, Brain Team balajiln@google.com
Pseudocode	Yes	Algorithm 1 Few-shot outlier exposure training 1: Input: In-distribution train set Din train = {(x, y)} with K classes, out-of-distribution train subset Dout few shot = {(x, y)} with O classes, oversampling factor Γ, a pretrained feature extractor f( ) : x ! z, a simple classiﬁcation head h( ) : z ! p 2 RK+O. 2: Initialize: Initialize h( ) at random, generate random batches from Din train, oversampling Dout train by Γ. 3: for train_step = 1 to max_step do 4: loss = Cross Entropy(h(f(x)), y) 5: SGD update of h( ) w.r.t loss 6: end for Algorithm 2 Few-shot outlier inference In-distribution test set Din test = {(x, y)} with K classes, out-ofdistribution test subset Dout test = {(X, y)} with O classes, a pre-trained f( ) : x ! z from inputs to embedding vectors, a trained classiﬁcation head h( ) : z ! p 2 RK+O. 2: Compute scorein oe(x), x 2 Din test 3: Compute scoreout oe (x), x 2 Dout test 4: Compute AUROC based on the scores.
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described in this paper. It mentions that they will open source code related to human performance estimation, but not the main experimental code.
Open Datasets	Yes	For instance, for a model trained on CIFAR-100 (which consists of classes such as mammals, ﬁsh, ﬂowers, fruits, household devices, trees, vehicles, insects, etc), a far-OOD task would be detecting digits from the street-view house numbers (SVHN) dataset as outliers. For the same model, detecting images from the CIFAR-10 dataset (which consists of the following 10 classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck) would be considered a near-OOD task, which is more difﬁcult as the classes are semantically similar.
Dataset Splits	Yes	Let Dout denote an outof-distribution dataset of (xout, yout) pairs where yout 2 Yout := {K+1, . . . , K+O}, Yout\Yin = ;. Depending on how different Dout is from Din, we categorize the OOD detection tasks into near OOD and far-OOD. We ﬁrst study the scenario where the model is ﬁne-tuned only on the training set Din train without any access to OOD data. The test set contains Din test and Dout test for evaluating OOD performance using AUROC. Next, we explore the scenario where a small number of OOD examples are available for training, i.e. the few-shot outlier exposure setting. In this setting, the training set contains Din few shot, where \|Dout few shot\| is often smaller than 100 per OOD class.
Hardware Specification	Yes	We ﬁne-tune the full Vi T architecture on a downstream task that is either the CIFAR-10 or CIFAR-100 classiﬁcation problem (using a TPU in Google Colab). ... The model is pre-trained for 300,000 steps using learning rate of 0.001 and Adam optimizer [Kingma and Ba, 2014] on TPU, and the accuracy for predicting the masked token is 48.35%.
Software Dependencies	No	The paper mentions software like scikit-learn and Adam optimizer, but does not provide specific version numbers for any of them. For example, it says 'We used a single layer MLP from scikit-learn [Pedregosa et al., 2011], batch size 200, L2 penalty of 1, learning rate 0.001 with Adam, maximum of 1,000 iterations.' but without specific versions for scikit-learn or Adam.
Experiment Setup	Yes	We used a single layer MLP from scikit-learn [Pedregosa et al., 2011], batch size 200, L2 penalty of 1, learning rate 0.001 with Adam, maximum of 1,000 iterations.