Exploring the Limits of Out-of-Distribution Detection

Authors: Stanislav Fort, Jie Ren, Balaji Lakshminarayanan

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that large-scale pre-trained transformers can significantly improve the state-of-the-art (SOTA) on a range of near OOD tasks across different data modalities. For instance, on CIFAR-100 vs CIFAR-10 OOD detection, we improve the AUROC from 85% (current SOTA) to 96% using Vision Transformers pre-trained on Image Net-21k. On a challenging genomics OOD detection benchmark, we improve the AUROC from 66% to 77% using transformers and unsupervised pre-training.
Researcher Affiliation Collaboration Stanislav Fort Stanford University sfort1@stanford.edu Jie Ren Google Research, Brain Team jjren@google.com Balaji Lakshminarayanan Google Research, Brain Team balajiln@google.com
Pseudocode Yes Algorithm 1 Few-shot outlier exposure training 1: Input: In-distribution train set Din train = {(x, y)} with K classes, out-of-distribution train subset Dout few shot = {(x, y)} with O classes, oversampling factor Γ, a pretrained feature extractor f( ) : x ! z, a simple classification head h( ) : z ! p 2 RK+O. 2: Initialize: Initialize h( ) at random, generate random batches from Din train, oversampling Dout train by Γ. 3: for train_step = 1 to max_step do 4: loss = Cross Entropy(h(f(x)), y) 5: SGD update of h( ) w.r.t loss 6: end for Algorithm 2 Few-shot outlier inference In-distribution test set Din test = {(x, y)} with K classes, out-ofdistribution test subset Dout test = {(X, y)} with O classes, a pre-trained f( ) : x ! z from inputs to embedding vectors, a trained classification head h( ) : z ! p 2 RK+O. 2: Compute scorein oe(x), x 2 Din test 3: Compute scoreout oe (x), x 2 Dout test 4: Compute AUROC based on the scores.
Open Source Code No The paper does not provide concrete access to source code for the methodology described in this paper. It mentions that they will open source code related to human performance estimation, but not the main experimental code.
Open Datasets Yes For instance, for a model trained on CIFAR-100 (which consists of classes such as mammals, fish, flowers, fruits, household devices, trees, vehicles, insects, etc), a far-OOD task would be detecting digits from the street-view house numbers (SVHN) dataset as outliers. For the same model, detecting images from the CIFAR-10 dataset (which consists of the following 10 classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck) would be considered a near-OOD task, which is more difficult as the classes are semantically similar.
Dataset Splits Yes Let Dout denote an outof-distribution dataset of (xout, yout) pairs where yout 2 Yout := {K+1, . . . , K+O}, Yout\Yin = ;. Depending on how different Dout is from Din, we categorize the OOD detection tasks into near OOD and far-OOD. We first study the scenario where the model is fine-tuned only on the training set Din train without any access to OOD data. The test set contains Din test and Dout test for evaluating OOD performance using AUROC. Next, we explore the scenario where a small number of OOD examples are available for training, i.e. the few-shot outlier exposure setting. In this setting, the training set contains Din few shot, where |Dout few shot| is often smaller than 100 per OOD class.
Hardware Specification Yes We fine-tune the full Vi T architecture on a downstream task that is either the CIFAR-10 or CIFAR-100 classification problem (using a TPU in Google Colab). ... The model is pre-trained for 300,000 steps using learning rate of 0.001 and Adam optimizer [Kingma and Ba, 2014] on TPU, and the accuracy for predicting the masked token is 48.35%.
Software Dependencies No The paper mentions software like scikit-learn and Adam optimizer, but does not provide specific version numbers for any of them. For example, it says 'We used a single layer MLP from scikit-learn [Pedregosa et al., 2011], batch size 200, L2 penalty of 1, learning rate 0.001 with Adam, maximum of 1,000 iterations.' but without specific versions for scikit-learn or Adam.
Experiment Setup Yes We used a single layer MLP from scikit-learn [Pedregosa et al., 2011], batch size 200, L2 penalty of 1, learning rate 0.001 with Adam, maximum of 1,000 iterations.