Improving Optimization for Models With Continuous Symmetry Breaking
Authors: Robert Bamler, Stephan Mandt
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the proposed Goldstone-GD optimization algorithm on the three example models introduced in Section 3.2. We compare Goldstone-GD to standard GD, to Ada Grad (Duchi et al., 2011), and to Adam (Kingma & Ba, 2014). |
| Researcher Affiliation | Industry | Robert Bamler 1 Stephan Mandt 1 1Disney Research, Glendale, CA, USA. Correspondence to: Robert Bamler <robert.bamler@gmail.com>, Stephan Mandt <stephan.mandt@gmail.com>. |
| Pseudocode | Yes | Algorithm 1: Goldstone Gradient Descent (Goldstone-GD) |
| Open Source Code | No | The paper does not provide an explicit statement or a link to open-source code for the described methodology. |
| Open Datasets | Yes | We fit the sparse dynamic Bernoulli factorization model defined in Eqs. 6-9 in Section 3.2 to the Movielens 20M data set2 (Harper & Konstan, 2016). We fit the model to digitized books from the years 1800 to 2008 in the Google Books corpus3 (Michel et al., 2011) |
| Dataset Splits | Yes | We split randomly across all bins into 50% training, 20% validation, and 30% test set. |
| Hardware Specification | No | The paper mentions 'embedding dimension to d = 100 due to hardware constraints' but does not provide specific details about the hardware used for the experiments (e.g., GPU/CPU models, memory). |
| Software Dependencies | No | The paper mentions optimizers like 'Ada Grad' and 'Adam' but does not specify version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | We use T = 30 time steps and a coupling strength of λ = 10. We train the model with standard GD (baseline) and with Goldstone-GD with k1 = 50 and k2 = 10. We find fastest convergence for the baseline method if we clip the gradients to an interval [ g, g] and use a decreasing learning rate ρs = ρ0( s/(s+ s))0.7 despite the noise-free gradient. Here, s is the training iteration. We optimize the hyperparameters for fastest convergence in the baseline and find g = 0.01, ρ0 = 1, and s = 100. |