Marthe: Scheduling the Learning Rate Via Online Hypergradients
Authors: Michele Donini, Luca Franceschi, Orchid Majumder, Massimiliano Pontil, Paolo Frasconi
IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We performed an extensive set of experiments in order to compare MARTHE, RTHO, and HD. We also considered a classic LR scheduling baseline in the form of exponential decay (Exponential) where the LR schedule is defined by ηt = η0γt. |
| Researcher Affiliation | Collaboration | Michele Donini1 , Luca Franceschi2,3 , Orchid Majumder1 , Massimiliano Pontil2,3 and Paolo Frasconi4 1Amazon 2University College London, London, UK 3Istituto Italiano di Tecnologia, Genova, Italy 4Universit a di Firenze, Firenze, Italy donini@amazon.com; luca.franceschi@iit.it |
| Pseudocode | Yes | Algorithm 1 presents the pseudocode of MARTHE. |
| Open Source Code | Yes | Finally, our Py Torch implementation of the methods and the experimental framework to reproduce the results is available at https://github.com/awslabs/adatune. |
| Open Datasets | Yes | We trained three-layers feed forward neural networks with 500 hidden units per layer on a subset of 7000 MNIST [Le Cun et al., 1998] images. |
| Dataset Splits | Yes | We further sampled 700 images to form the validation set and defined E to be the validation loss after T = 512 optimization steps (about 7 epochs). |
| Hardware Specification | Yes | We performed all experiments using AWS P3.2XL instances, each providing one NVIDIA Tesla V100 GPU. |
| Software Dependencies | No | The paper mentions 'Our Py Torch implementation' but does not specify version numbers for PyTorch or any other software libraries used in the experiments. |
| Experiment Setup | Yes | We used a cross-entropy loss and SGD as optimization dynamics Φ, with a mini-batch size of 100. We initialized η = 0.01 1512 for LRS-OPT and set η0 = 0.01 for MARTHE. We used two alternative optimization dynamics: SGDM with the momentum hyperparameter fixed to 0.9 and Adam with the commonly suggested default values β1 = 0.9 and β2 = 0.999. We fixed the batch size to 128, the initial learning rate η0 = 0.1 for SGDM and 0.003 for Adam, and the weight decay (i.e. 2-norm) to 5 10 4. For the adaptive methods, we sampled β in [10 3, 10 6] log-uniformly, and for our method, we sampled µ between 0.9 and 0.999. Finally, we picked the decay factor γ for Exponential log-uniformly in [0.9, 1.0]. Gradients were clipped to an absolute value of 100.0. |