Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

Authors: Daniel Dodd, Louis Sharrock, Christopher Nemeth

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our approach is validated through numerical experiments, demonstrating competitive performance against learning-rate-dependent algorithms. We assess the numerical performance of RDo G (Algorithm 1), RDo WG (Algorithm 2), and NRDo G against manually tuned RSGD (Bonnabel, 2013) and RADAM (Becigneul & Ganea, 2019).
Researcher Affiliation Academia Daniel Dodd 1 Louis Sharrock 1 Christopher Nemeth 1 1Department of Mathematics and Statistics, Lancaster University, UK.
Pseudocode Yes Algorithm 1 RDo G; Algorithm 2 RDo WG; Algorithm 3 T-RDo WG; Algorithm 4 in Appendix F.
Open Source Code Yes Code to reproduce the experiments is available at https://github.com/daniel-dodd/riemannian_dog.
Open Datasets Yes We consider datasets Wine, Waveform-5000, and Tiny Image Net. The Word Net noun hierarchy (Miller et al., 1990) is a lexical database of English words organized into a hierarchical structure.
Dataset Splits No The paper mentions 'Each dataset has an 80:20 train-test split per replication.' but does not specify a separate validation split.
Hardware Specification Yes Implementing all algorithms in Python 3 with JAX (Bradbury et al., 2018), our experiments run on a Mac Book Pro 16 (2021) with an Apple M1 Pro chip and 16GB of RAM.
Software Dependencies No The paper states 'Implementing all algorithms in Python 3 with JAX'. While 'Python 3' is a version, JAX is not given with a specific version number. Other software like scikit-learn is mentioned without a version.
Experiment Setup Yes We employ RADAM and RSGD with a grid of twenty logarithmically spaced learning rates η [10 8, 106]. On the other hand, we investigate RDo G and RDo WG with ten logarithmically spaced initial distance values ϵ [10 8, 100]. In training, Wine uses the full batch for T = 5000 iterations, and Waveform-5000 and Tiny Image Net use batch sizes of 64 for T = 2000 iterations. For initialization, following Nickel & Kiela (2017), we uniformly initialize the embeddings in [ 10 3, 10 3]d and consider ten logarithmically spaced learning rates η [10 2, 102] and five logarithmically spaced initial distance estimates ϵ [10 10, 10 6]. In the first ten epochs, we use RSGD with a reduced learning rate of η/10 for RSGD and RADAM. Thereafter, we run the optimizers on the initialized embeddings for one thousand epochs, with each iteration having a batch size of ten and fifty uniformly sampled negative samples. We repeat this experiment over five replications.