On the difficulty of unbiased alpha divergence minimization

Authors: Tomas Geffner, Justin Domke

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical evaluation. We now present empirical results that motivate this work. These demonstrate two important phenomena. First, for larger α, optimization scales poorly to high dimensions. Understanding this is the central goal of this paper. Second, this may happen even when the gradient estimator s variance is very small. Instead, we propose that this failure is best explained by the estimator s (SNR), which is known to be related to optimization convergence (Section 4.3). In Section 5 we empirically confirm that the same phenomena seems to occur in real problems.
Researcher Affiliation Academia 1College of Information and Computer Science, University of Massachusetts, Amherst, MA, USA.
Pseudocode No The paper defines gradient estimators using mathematical equations (e.g., eq. 4, eq. 5) but does not include any pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any statement about releasing source code or provide any links to a code repository.
Open Datasets Yes We use two datasets: Iris and Australian, which have dimensionalities 4 and 14, respectively.
Dataset Splits No The paper mentions using "a subset of 100 samples" for the datasets but does not specify any train/validation/test splits, or even mention a validation set.
Hardware Specification No The paper does not specify any hardware details such as CPU, GPU models, or memory used for conducting the experiments.
Software Dependencies No The paper mentions using "Adam (Kingma & Ba, 2014)" as an optimizer but does not provide specific version numbers for any software, libraries, or frameworks used.
Experiment Setup Yes We initialize σi = 2, and optimize Dα(p||qw) from eq. 2. We do so by running SGD with the gradient estimator gdrep α for 1000 steps. For each triplet (d, α, N) we tuned the step-size; we ran simulations for all step-sizes in the set {10i}7 i= 7 and selected the one that lead to the best final performance. We use a diagonal Gaussian as variational distribution qw, initialized to have mean zero and covariance identity. We optimize Lα by running SGD with unbiased gradient estimates for 1000 steps.