Provable Smoothness Guarantees for Black-Box Variational Inference
Authors: Justin Domke
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Figure 2. Naive optimization can work well, but is sensitive to initialization. Looseness of the objective obtained by naive gradient descent (γ = 1/M), projected gradient descent (γ = 1/(2M)) and proximal gradient descent (γ = 1/M). Optimization starts with m = 0 and C = I where is a scaling factor. Initializing C = 0 is fine for proximal or projected gradient descent, but naive gradient descent requires careful initialization. Results for other datasets in Sec. 8 (supplement). Figure 3. Naive optimization is similar to proximal for large initial C, but worse for small C. Results of optimizing the ELBO with different scaling factors on four different datasets. The two right columns show results after enough iterations for proximal optimization to converge to less than 10 1. The left column shows results after 1 10-th as many iterations. Proximal optimization starting with C 0 always performs well. Projected gradient descent requires more iterations. Naive optimization can work well, but is not guaranteed and requires careful initialization. |
| Researcher Affiliation | Academia | 1College of Computing and Information Sciences, University of Massachusetts, Amherst, USA. Correspondence to: Justin Domke <domke@cs.umass.edu>. |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | No | No explicit statement or link indicating that the authors released source code for the methodology described in this paper was found. |
| Open Datasets | No | The paper mentions using datasets like "linear regression data (boston, fires)" and "logistic regression (australian, ionosphere)" in Section 6. However, it does not provide concrete access information (e.g., URL, DOI, citation with authors/year) for these datasets, only their names. |
| Dataset Splits | No | No specific dataset split information (percentages, sample counts, or references to predefined splits) for training, validation, or testing was provided in the paper. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments were mentioned in the paper. |
| Software Dependencies | No | No specific ancillary software details with version numbers (e.g., Python 3.8, PyTorch 1.9) were provided in the paper. |
| Experiment Setup | Yes | We initialize m to zero and C = I for a range of scaling constants . Figure 2 shows example results on two datasets. For projected or proximal gradient descent, simply initializing C = 0 is fine. For naive gradient descent, initialization is subtle, since too small a leads to an enormous entropy gradient (and thus jumps ), while for large , all algorithms converge slowly. ... Proximal optimization starting with C 0 always performs well. Projected gradient descent requires more iterations. Naive optimization can work well, but is not guaranteed and requires careful initialization. ... proximal gradient descent always converges with a step-size of γ = 1/M. ... projected gradient descent always converges with a step-size of 1/(2M). ... The two right columns show results after enough iterations for proximal optimization to converge to less than 10 1. |