Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds
Authors: Daniel Dodd, Louis Sharrock, Christopher Nemeth
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our approach is validated through numerical experiments, demonstrating competitive performance against learning-rate-dependent algorithms. We assess the numerical performance of RDo G (Algorithm 1), RDo WG (Algorithm 2), and NRDo G against manually tuned RSGD (Bonnabel, 2013) and RADAM (Becigneul & Ganea, 2019). |
| Researcher Affiliation | Academia | Daniel Dodd 1 Louis Sharrock 1 Christopher Nemeth 1 1Department of Mathematics and Statistics, Lancaster University, UK. |
| Pseudocode | Yes | Algorithm 1 RDo G; Algorithm 2 RDo WG; Algorithm 3 T-RDo WG; Algorithm 4 in Appendix F. |
| Open Source Code | Yes | Code to reproduce the experiments is available at https://github.com/daniel-dodd/riemannian_dog. |
| Open Datasets | Yes | We consider datasets Wine, Waveform-5000, and Tiny Image Net. The Word Net noun hierarchy (Miller et al., 1990) is a lexical database of English words organized into a hierarchical structure. |
| Dataset Splits | No | The paper mentions 'Each dataset has an 80:20 train-test split per replication.' but does not specify a separate validation split. |
| Hardware Specification | Yes | Implementing all algorithms in Python 3 with JAX (Bradbury et al., 2018), our experiments run on a Mac Book Pro 16 (2021) with an Apple M1 Pro chip and 16GB of RAM. |
| Software Dependencies | No | The paper states 'Implementing all algorithms in Python 3 with JAX'. While 'Python 3' is a version, JAX is not given with a specific version number. Other software like scikit-learn is mentioned without a version. |
| Experiment Setup | Yes | We employ RADAM and RSGD with a grid of twenty logarithmically spaced learning rates η [10 8, 106]. On the other hand, we investigate RDo G and RDo WG with ten logarithmically spaced initial distance values ϵ [10 8, 100]. In training, Wine uses the full batch for T = 5000 iterations, and Waveform-5000 and Tiny Image Net use batch sizes of 64 for T = 2000 iterations. For initialization, following Nickel & Kiela (2017), we uniformly initialize the embeddings in [ 10 3, 10 3]d and consider ten logarithmically spaced learning rates η [10 2, 102] and five logarithmically spaced initial distance estimates ϵ [10 10, 10 6]. In the first ten epochs, we use RSGD with a reduced learning rate of η/10 for RSGD and RADAM. Thereafter, we run the optimizers on the initialized embeddings for one thousand epochs, with each iteration having a batch size of ten and fifty uniformly sampled negative samples. We repeat this experiment over five replications. |