Learning by Erasing: Conditional Entropy Based Transferable Out-of-Distribution Detection

Authors: Meng Xing, Zhiyong Feng, Yong Su, Changjae Oh

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments Implementation Details All three parallel encoder branches consist of multiple convolution and upsampling layers with different kernel sizes (3 3, 5 5 and 7 7). A shared convolutional layer with a kernel size of 1 1 is utilized to transform the features from 3 parallel encoder branches into uncertainty estimation space. The decoder consists of two convolutional layers with kernel size of 3 3. We set the batch size and learning rate to 64 and 10 5, respectively.  is empirically set to 0.8. We trained the network for 250 epochs, taking about 48.29 hours. We conduct all experiments on a single NVIDIA GPU 3080 that follows the experimental setup of the baseline methods. Experimental Setting Datasets We train our model on Image Net32 (Deng et al. 2009) and validate our model on different ID datasets, including MNIST (Le Cun et al. 1998), Fashion MNIST (Xiao, Rasul, and Vollgraf 2017), SVHN (Netzer et al. 2011), Celeb A (Liu et al. 2015) and CIFAR10 (Krizhevsky and Hinton 2009). All the inputs are resized to 32 32 to fit the input size of UEN. We transform the grayscale image into an RGB image by replicating the channel. Metrics We use threshold-independent metrics: the area under the receiver operating characteristic curve (AUROC) (Davis and Goadrich 2006) and the area under the precision-recall curve (AUPR) to evaluate our method. We consider OOD data and ID data as positive and negative ones for detection, respectively. Unless noted otherwise, we calculate the False Positive Rate (FPR) of the detector when the threshold is set at 95% TPR. We randomly select 10k samples from the test set of the target dataset. We generate test sample groups according to group size gs. For the fair comparison, we generate the test set 2 times and test groups 5 times then report the averaged result. OOD Detection To evaluate the robustness of our method, we utilize five different datasets as ID datasets and test each of them on one (MNIST or Fashion MNIST) or three (SVHN, Celeb A and CIFAR10) different disjoint OOD datasets. The obtained performance for OOD detection and comparison with three baselines including the Ty-test (Nalisnick et al. 2019), DOCR-TC-M (Zhang et al. 2020) and RF-GM (Jiang, Sun, and Yu 2022) are shown in Table 1.
Researcher Affiliation Academia Meng Xing1,3, Zhiyong Feng1, Yong Su2, Changjae Oh3 1College of Intelligence and Computing, Tianjin University 2Tianjin Normal University 3Centre for Intelligent Sensing, Queen Mary University of London
Pseudocode Yes Algorithm 1: OOD Detection Algorithm
Open Source Code No The project codes will be open-sourced on our project website.
Open Datasets Yes We train our model on Image Net32 (Deng et al. 2009) and validate our model on different ID datasets, including MNIST (Le Cun et al. 1998), Fashion MNIST (Xiao, Rasul, and Vollgraf 2017), SVHN (Netzer et al. 2011), Celeb A (Liu et al. 2015) and CIFAR10 (Krizhevsky and Hinton 2009).
Dataset Splits No The paper states 'validate our model on different ID datasets' which refers to evaluating on entirely new datasets, not a specific train/validation/test split for reproducibility within a single dataset. It mentions 'We randomly select 10k samples from the test set of the target dataset', but lacks explicit percentages or counts for training and validation splits.
Hardware Specification Yes We conduct all experiments on a single NVIDIA GPU 3080 that follows the experimental setup of the baseline methods.
Software Dependencies No The paper does not provide specific software dependency versions (e.g., Python, PyTorch, TensorFlow versions) required to reproduce the experiments.
Experiment Setup Yes We set the batch size and learning rate to 64 and 10 5, respectively.  is empirically set to 0.8. We trained the network for 250 epochs, taking about 48.29 hours.