reproducibilityindex.ai

Confidence Sets and Hypothesis Testing in a Likelihood-Free Inference Setting

Authors: Niccolo Dalmasso, Rafael Izbicki, Ann Lee

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the efﬁcacy of ACORE with both theoretical and empirical results. Our implementation is available on Github. In Section 4, we show empirical results connecting the power of the constructed hypothesis tests to the performance of the classiﬁer. We consider two examples where the true likelihood is known. First we investigate how the power of ACORE and the size of the derived conﬁdence sets depend on the performance of the classiﬁer used in the odds ratio estimation (Section 3.1). We consider three classiﬁers: multilayer perceptron (MLP), nearest neighbor (NN) and quadratic discriminant analysis (QDA). For different values of B (sample size for estimating odds ratios), we compute the binary cross entropy (a measure of classiﬁer performance), the power as a function of θ, and the size of the constructed conﬁdence set. Table 2 summarizes results based on 100 repetitions.
Researcher Affiliation	Academia	Niccol o Dalmasso 1 Rafael Izbicki 2 Ann B. Lee 1 1Department of Statistics & Data Science, Carnegie Mellon University, Pittsburgh, USA 2Department of Statistics, Federal University of S ao Carlos, S ao Paulo, Brazil. Correspondence to: Niccol o Dalmasso <ndalmass@stat.cmu.edu>.
Pseudocode	Yes	Algorithm 1 Estimate the critical value C for a level-α test of composite hypotheses H0 : θ Θ0 vs. H1 : θ Θ1. Algorithm 2 [Many Simple Null Hypotheses] Estimate the critical values Cθ0 for a level-α test of H0,θ0 : θ = θ0 vs. H1,θ0 : θ = θ0 for all θ0 Θ simultaneously.
Open Source Code	Yes	Our implementation is available on Github.
Open Datasets	No	The paper describes using a "stochastic forward simulator" (Fθ) to generate data for its examples (Poisson, GMM, HEP model), and states that it uses a "labeled training sample TB" and a "training sample T B". However, it does not refer to or provide access information for any pre-existing publicly available dataset in the conventional sense that would require a link or formal citation.
Dataset Splits	No	The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) for training, validation, or testing.
Hardware Specification	Yes	More speciﬁcally, an 8-core Intel Xeon 3.33GHz X5680 CPU.
Software Dependencies	No	The paper mentions software like Pytorch and scikit-learn in the references, implying their use. However, it does not specify version numbers for these or any other software dependencies needed for replication.
Experiment Setup	Yes	We consider three classiﬁers: multilayer perceptron (MLP), nearest neighbor (NN) and quadratic discriminant analysis (QDA). For different values of B (sample size for estimating odds ratios), we compute the binary cross entropy. To compute the critical values in Algorithm 2, we use quantile gradient boosted trees and a large enough sample size B = 5000. For all 18 settings, the computation of one ACORE conﬁdence set takes between 10 to 30 seconds on a single CPU. We use n = 10. We use a 5-layer deep neural network with B = 100000.