reproducibilityindex.ai

On the difficulty of unbiased alpha divergence minimization

Authors: Tomas Geffner, Justin Domke

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluation. We now present empirical results that motivate this work. These demonstrate two important phenomena. First, for larger α, optimization scales poorly to high dimensions. Understanding this is the central goal of this paper. Second, this may happen even when the gradient estimator s variance is very small. Instead, we propose that this failure is best explained by the estimator s (SNR), which is known to be related to optimization convergence (Section 4.3). In Section 5 we empirically conﬁrm that the same phenomena seems to occur in real problems.
Researcher Affiliation	Academia	1College of Information and Computer Science, University of Massachusetts, Amherst, MA, USA.
Pseudocode	No	The paper defines gradient estimators using mathematical equations (e.g., eq. 4, eq. 5) but does not include any pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any statement about releasing source code or provide any links to a code repository.
Open Datasets	Yes	We use two datasets: Iris and Australian, which have dimensionalities 4 and 14, respectively.
Dataset Splits	No	The paper mentions using "a subset of 100 samples" for the datasets but does not specify any train/validation/test splits, or even mention a validation set.
Hardware Specification	No	The paper does not specify any hardware details such as CPU, GPU models, or memory used for conducting the experiments.
Software Dependencies	No	The paper mentions using "Adam (Kingma & Ba, 2014)" as an optimizer but does not provide specific version numbers for any software, libraries, or frameworks used.
Experiment Setup	Yes	We initialize σi = 2, and optimize Dα(p\|\|qw) from eq. 2. We do so by running SGD with the gradient estimator gdrep α for 1000 steps. For each triplet (d, α, N) we tuned the step-size; we ran simulations for all step-sizes in the set {10i}7 i= 7 and selected the one that lead to the best ﬁnal performance. We use a diagonal Gaussian as variational distribution qw, initialized to have mean zero and covariance identity. We optimize Lα by running SGD with unbiased gradient estimates for 1000 steps.