reproducibilityindex.ai

Further Analysis of Outlier Detection with Deep Generative Models

Authors: Ziyu Wang, Bin Dai, David Wipf, Jun Zhu

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we present a possible explanation for this phenomenon, starting from the observation that a model s typical set and high-density region may not conincide. From this vantage point we propose a novel outlier test, the empirical success of which suggests that the failure of existing likelihood-based outlier tests does not necessarily imply that the corresponding generative model is uncalibrated. We also conduct additional experiments to help disentangle the impact of low-level texture versus high-level semantics in differentiating outliers. In this section we evaluate the proposed test, with the goal of better understanding the previous ﬁndings in [3]. We consider three implementations of our white noise test, which use different sequences to compute the test statistics (1):
Researcher Affiliation	Collaboration	Ziyu Wang1,2, Bin Dai3, David Wipf4 and Jun Zhu1,2 1 Dept. of Comp. Sci. & Tech., Institute for AI, BNRist Center, Tsinghua-Bosch Joint ML Center, THBI Lab, Tsinghua University, Beijing, China 2Jiangsu Collaborative Innovation Center for Language Ability, Jiangsu Normal University 3Samsung Research China, Beijing, China 4AWS AI Lab, Shanghai, China
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code for the experiments is available at https://github.com/thu-ml/ood-dgm.
Open Datasets	Yes	We use CIFAR-10, Celeb A, and Tiny Image Net images as inliers, and CIFAR10, Celeb A and SVHN images as outliers. Note that both CIFAR datasets have been created from the 80 Million Tiny Images dataset [21].
Dataset Splits	No	The paper mentions using 'inlier test data' and 'outlier datasets' but does not specify explicit train/validation/test splits with percentages or sample counts. It refers to 'inlier test data' as the evaluation set, without detailing how it was partitioned from a larger training set or if a separate validation set was used for model tuning.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU or CPU models used for running its experiments. It only vaguely mentions 'within the limit of computational resources we have' without further elaboration.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions, or specific library versions) needed to replicate the experiment.
Experiment Setup	No	The paper refers to using 'the setups from the paper' for pretrained models or training 'under the same setup as in the original papers' for other DGMs and VAEs. While it mentions varying 'nz' for VAEs (e.g., 'nz = 64' or 'nz = 512' in Table 1), it does not provide comprehensive details on concrete hyperparameters such as learning rates, batch sizes, optimizers, or training epochs directly within the main text.