Kernel Stein Tests for Multiple Model Comparison

Authors: Jen Ning Lim, Makoto Yamada, Bernhard Schölkopf, Wittawat Jitkrittum

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on toy and real (Celeb A, Chicago Crime data) problems show that the two tests have high true positive rates with well-controlled error rates. Experiments (Section 5) on toy and real (Celeb A, Chicago Crime data) problems show that the two proposed tests have high true positive rates with well-controlled respective error rates FPR for Rel PSI and FDR for Rel Multi.
Researcher Affiliation Academia Jen Ning Lim Max Planck Institute for Intelligent Systems jlim@tuebingen.mpg.de Makoto Yamada Kyoto University, RIKEN AIP makoto.yamada@riken.jp Bernhard Schölkopf Max Planck Institute for Intelligent Systems bs@tuebingen.mpg.de Wittawat Jitkrittum Max Planck Institute for Intelligent Systems wittawat@tuebingen.mpg.de
Pseudocode Yes We present the asymptotic multivariate characterization of [ KSD 2 u in Theorem C.1. (see Algorithm 2 in the appendix for the full algorithm of Rel Multi). And since Σ is also unknown, we replace Σ with a consistent plug-in estimator ˆΣ given by Bounliphone et al. [4, Theorem 2] for \ MMD 2 u and Theorem C.1 for [ KSD 2 u. Specifically, we have as the threshold ˆtα := ˆσΦ 1 (1 α)Φ V+ ˆσ where ˆσ2 = η ˆΣη (see Algorithm 1 in the appendix for the full algorithm of Rel PSI).
Open Source Code Yes Code for reproducing the results can be found online.1 1https://github.com/jenninglim/model-comparison-test
Open Datasets Yes We consider the Celeb A dataset [24] which for each sample is an image of a celebrity labelled with 40 annotated features. For our final experiment, we examine density estimation models trained on the Chicago crime dataset considered by Jitkrittum et al. [19].
Dataset Splits No For our candidate models, we trained a Mixture of Gaussians (Mo G) with expectation maximization with C components where C {1, 2, 5}, Masked Auto-encoder for Density Estimation (MADE) [12] and a Masked Auto-regressive Flow (MAF) [25]. We split the data-set into disjoint sets such that 7000 samples are used for training and the remainder for testing. This specifies a train/test split, but the question explicitly asks for training/test/validation. Since validation is not mentioned, it's a 'No'.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running experiments are provided in the paper.
Software Dependencies No The paper mentions models like MADE and MAF, and kernel types, but does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the implementation or experiments.
Experiment Setup Yes Our proposal and baselines are all non-parametric kernel based test. For a fair comparison, all the tests use the same Gaussian kernel with its bandwidth chosen by the median heuristic. In Figure 1, it shows the rejection rates for all tests. As expected, the tests based on KSD have higher power than MMD due to having access to the density function. Additionally, linear time estimators perform worse than their complete counterpart. We fix n = 1000, dx = 5, dy = 20. In each trial, n = 2000 samples are used. Each autoregressive layer is a feedforward network with 512 hidden units. Both invertible models are trained with maximum likelihood with a small amount of ℓ2 penalty on the weights.