On the difficulty of unbiased alpha divergence minimization
Authors: Tomas Geffner, Justin Domke
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluation. We now present empirical results that motivate this work. These demonstrate two important phenomena. First, for larger α, optimization scales poorly to high dimensions. Understanding this is the central goal of this paper. Second, this may happen even when the gradient estimator s variance is very small. Instead, we propose that this failure is best explained by the estimator s (SNR), which is known to be related to optimization convergence (Section 4.3). In Section 5 we empirically confirm that the same phenomena seems to occur in real problems. |
| Researcher Affiliation | Academia | 1College of Information and Computer Science, University of Massachusetts, Amherst, MA, USA. |
| Pseudocode | No | The paper defines gradient estimators using mathematical equations (e.g., eq. 4, eq. 5) but does not include any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any statement about releasing source code or provide any links to a code repository. |
| Open Datasets | Yes | We use two datasets: Iris and Australian, which have dimensionalities 4 and 14, respectively. |
| Dataset Splits | No | The paper mentions using "a subset of 100 samples" for the datasets but does not specify any train/validation/test splits, or even mention a validation set. |
| Hardware Specification | No | The paper does not specify any hardware details such as CPU, GPU models, or memory used for conducting the experiments. |
| Software Dependencies | No | The paper mentions using "Adam (Kingma & Ba, 2014)" as an optimizer but does not provide specific version numbers for any software, libraries, or frameworks used. |
| Experiment Setup | Yes | We initialize σi = 2, and optimize Dα(p||qw) from eq. 2. We do so by running SGD with the gradient estimator gdrep α for 1000 steps. For each triplet (d, α, N) we tuned the step-size; we ran simulations for all step-sizes in the set {10i}7 i= 7 and selected the one that lead to the best final performance. We use a diagonal Gaussian as variational distribution qw, initialized to have mean zero and covariance identity. We optimize Lα by running SGD with unbiased gradient estimates for 1000 steps. |