Provable Smoothness Guarantees for Black-Box Variational Inference

Authors: Justin Domke

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Figure 2. Naive optimization can work well, but is sensitive to initialization. Looseness of the objective obtained by naive gradient descent (γ = 1/M), projected gradient descent (γ = 1/(2M)) and proximal gradient descent (γ = 1/M). Optimization starts with m = 0 and C = I where is a scaling factor. Initializing C = 0 is fine for proximal or projected gradient descent, but naive gradient descent requires careful initialization. Results for other datasets in Sec. 8 (supplement). Figure 3. Naive optimization is similar to proximal for large initial C, but worse for small C. Results of optimizing the ELBO with different scaling factors on four different datasets. The two right columns show results after enough iterations for proximal optimization to converge to less than 10 1. The left column shows results after 1 10-th as many iterations. Proximal optimization starting with C 0 always performs well. Projected gradient descent requires more iterations. Naive optimization can work well, but is not guaranteed and requires careful initialization.
Researcher Affiliation Academia 1College of Computing and Information Sciences, University of Massachusetts, Amherst, USA. Correspondence to: Justin Domke <domke@cs.umass.edu>.
Pseudocode No No structured pseudocode or algorithm blocks were found in the paper.
Open Source Code No No explicit statement or link indicating that the authors released source code for the methodology described in this paper was found.
Open Datasets No The paper mentions using datasets like "linear regression data (boston, fires)" and "logistic regression (australian, ionosphere)" in Section 6. However, it does not provide concrete access information (e.g., URL, DOI, citation with authors/year) for these datasets, only their names.
Dataset Splits No No specific dataset split information (percentages, sample counts, or references to predefined splits) for training, validation, or testing was provided in the paper.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments were mentioned in the paper.
Software Dependencies No No specific ancillary software details with version numbers (e.g., Python 3.8, PyTorch 1.9) were provided in the paper.
Experiment Setup Yes We initialize m to zero and C = I for a range of scaling constants . Figure 2 shows example results on two datasets. For projected or proximal gradient descent, simply initializing C = 0 is fine. For naive gradient descent, initialization is subtle, since too small a leads to an enormous entropy gradient (and thus jumps ), while for large , all algorithms converge slowly. ... Proximal optimization starting with C 0 always performs well. Projected gradient descent requires more iterations. Naive optimization can work well, but is not guaranteed and requires careful initialization. ... proximal gradient descent always converges with a step-size of γ = 1/M. ... projected gradient descent always converges with a step-size of 1/(2M). ... The two right columns show results after enough iterations for proximal optimization to converge to less than 10 1.