Better Parameter-Free Stochastic Optimization with ODE Updates for Coin-Betting
Authors: Keyi Chen, John Langford, Francesco Orabona6239-6247
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show empirically that this new parameter-free algorithm outperforms algorithms with the best default learning rates and almost matches the performance of finely tuned baselines without anything to tune.4 Empirical Evaluation Here, we compare CODE with SGD, SGD with truncated models (a Prox) (Asi and Duchi 2019), SGD with Importance Weight Aware updates (IWA) (Karampatziakis and Langford 2011), Ada Grad (Duchi, Hazan, and Singer 2011), Adam (Kingma and Ba 2015), the coin-betting algorithm in (2) (Coin) (Orabona and Pal 2016) and the recursive coinbetting algorithm (Recursive) (Cutkosky and Sarlos 2019). We test the ability of CODE to get a good generalization error. Hence, we perform experiments with 21 different machine learning binary classification datasets and 17 regression datasets from the LIBSVM website (Chang and Lin 2011) and Open ML(Vanschoren et al. 2013). |
| Researcher Affiliation | Collaboration | Keyi Chen1, John Langford2, Francesco Orabona1 1 Boston University, Boston, MA 2 Microsoft Research, New York, NY keyichen@bu.edu, jcl@microsoft.com, francesco@orabona.com |
| Pseudocode | Yes | Algorithm 1: Coin-betting ODE (CODE) Algorithm 1: Initialize: Wealth0 = 1, H1 = 1, θ1 = 0 Rd 2: for t = 1, . . . , T do 3: Query point xt = Wealtht Ht θt 4: Receive gt such that E[gt] F(xt), gt 1 5: Calculate ht = min(1, ht), where ht is the zero of the function φ in (7) 6: Update Wealtht+1 = Wealtht e gt,θt ln(1+ ht Ht )+ gt 2(ht+Ht ln Ht Ht+ht ) 7: Update Ht+1 = Ht + ht 8: Update θt+1 = θt htgt 9: end for |
| Open Source Code | No | The paper does not provide a direct link to open-source code for the described methodology, nor does it explicitly state that the code is released or available in supplementary materials. |
| Open Datasets | Yes | We pre-process the samples normalizing them to unit norm vectors. We shuffle the data and use 70% for training, 15% for validation, and hold out 15% for testing. We perform experiments with 21 different machine learning binary classification datasets and 17 regression datasets from the LIBSVM website (Chang and Lin 2011) and Open ML(Vanschoren et al. 2013). |
| Dataset Splits | Yes | We shuffle the data and use 70% for training, 15% for validation, and hold out 15% for testing. |
| Hardware Specification | No | The paper does not specify any hardware details (e.g., GPU models, CPU types, memory) used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies or libraries used in the implementation. |
| Experiment Setup | Yes | For SGD, a Prox and IWA, we use the optimal worst-case step size for stochastic convex optimization: ηk = η0/ k, and tune the initial step size η0. In the adaptive learning rate methods, Ada Grad and Adam, we tune the initial step size η0. For each repetition and dataset, we use the validation set to select the best learning rate, train using that learning rate, test on the test set and report the average of normalized loss. |