reproducibilityindex.ai

Kernel Stein Tests for Multiple Model Comparison

Authors: Jen Ning Lim, Makoto Yamada, Bernhard Schölkopf, Wittawat Jitkrittum

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on toy and real (Celeb A, Chicago Crime data) problems show that the two tests have high true positive rates with well-controlled error rates. Experiments (Section 5) on toy and real (Celeb A, Chicago Crime data) problems show that the two proposed tests have high true positive rates with well-controlled respective error rates FPR for Rel PSI and FDR for Rel Multi.
Researcher Affiliation	Academia	Jen Ning Lim Max Planck Institute for Intelligent Systems jlim@tuebingen.mpg.de Makoto Yamada Kyoto University, RIKEN AIP makoto.yamada@riken.jp Bernhard Schölkopf Max Planck Institute for Intelligent Systems bs@tuebingen.mpg.de Wittawat Jitkrittum Max Planck Institute for Intelligent Systems wittawat@tuebingen.mpg.de
Pseudocode	Yes	We present the asymptotic multivariate characterization of [ KSD 2 u in Theorem C.1. (see Algorithm 2 in the appendix for the full algorithm of Rel Multi). And since Σ is also unknown, we replace Σ with a consistent plug-in estimator ˆΣ given by Bounliphone et al. [4, Theorem 2] for \ MMD 2 u and Theorem C.1 for [ KSD 2 u. Speciﬁcally, we have as the threshold ˆtα := ˆσΦ 1 (1 α)Φ V+ ˆσ where ˆσ2 = η ˆΣη (see Algorithm 1 in the appendix for the full algorithm of Rel PSI).
Open Source Code	Yes	Code for reproducing the results can be found online.1 1https://github.com/jenninglim/model-comparison-test
Open Datasets	Yes	We consider the Celeb A dataset [24] which for each sample is an image of a celebrity labelled with 40 annotated features. For our final experiment, we examine density estimation models trained on the Chicago crime dataset considered by Jitkrittum et al. [19].
Dataset Splits	No	For our candidate models, we trained a Mixture of Gaussians (Mo G) with expectation maximization with C components where C {1, 2, 5}, Masked Auto-encoder for Density Estimation (MADE) [12] and a Masked Auto-regressive Flow (MAF) [25]. We split the data-set into disjoint sets such that 7000 samples are used for training and the remainder for testing. This specifies a train/test split, but the question explicitly asks for training/test/validation. Since validation is not mentioned, it's a 'No'.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running experiments are provided in the paper.
Software Dependencies	No	The paper mentions models like MADE and MAF, and kernel types, but does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the implementation or experiments.
Experiment Setup	Yes	Our proposal and baselines are all non-parametric kernel based test. For a fair comparison, all the tests use the same Gaussian kernel with its bandwidth chosen by the median heuristic. In Figure 1, it shows the rejection rates for all tests. As expected, the tests based on KSD have higher power than MMD due to having access to the density function. Additionally, linear time estimators perform worse than their complete counterpart. We ﬁx n = 1000, dx = 5, dy = 20. In each trial, n = 2000 samples are used. Each autoregressive layer is a feedforward network with 512 hidden units. Both invertible models are trained with maximum likelihood with a small amount of ℓ2 penalty on the weights.