Marthe: Scheduling the Learning Rate Via Online Hypergradients

Authors: Michele Donini, Luca Franceschi, Orchid Majumder, Massimiliano Pontil, Paolo Frasconi

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We performed an extensive set of experiments in order to compare MARTHE, RTHO, and HD. We also considered a classic LR scheduling baseline in the form of exponential decay (Exponential) where the LR schedule is defined by ηt = η0γt.
Researcher Affiliation Collaboration Michele Donini1 , Luca Franceschi2,3 , Orchid Majumder1 , Massimiliano Pontil2,3 and Paolo Frasconi4 1Amazon 2University College London, London, UK 3Istituto Italiano di Tecnologia, Genova, Italy 4Universit a di Firenze, Firenze, Italy donini@amazon.com; luca.franceschi@iit.it
Pseudocode Yes Algorithm 1 presents the pseudocode of MARTHE.
Open Source Code Yes Finally, our Py Torch implementation of the methods and the experimental framework to reproduce the results is available at https://github.com/awslabs/adatune.
Open Datasets Yes We trained three-layers feed forward neural networks with 500 hidden units per layer on a subset of 7000 MNIST [Le Cun et al., 1998] images.
Dataset Splits Yes We further sampled 700 images to form the validation set and defined E to be the validation loss after T = 512 optimization steps (about 7 epochs).
Hardware Specification Yes We performed all experiments using AWS P3.2XL instances, each providing one NVIDIA Tesla V100 GPU.
Software Dependencies No The paper mentions 'Our Py Torch implementation' but does not specify version numbers for PyTorch or any other software libraries used in the experiments.
Experiment Setup Yes We used a cross-entropy loss and SGD as optimization dynamics Φ, with a mini-batch size of 100. We initialized η = 0.01 1512 for LRS-OPT and set η0 = 0.01 for MARTHE. We used two alternative optimization dynamics: SGDM with the momentum hyperparameter fixed to 0.9 and Adam with the commonly suggested default values β1 = 0.9 and β2 = 0.999. We fixed the batch size to 128, the initial learning rate η0 = 0.1 for SGDM and 0.003 for Adam, and the weight decay (i.e. 2-norm) to 5 10 4. For the adaptive methods, we sampled β in [10 3, 10 6] log-uniformly, and for our method, we sampled µ between 0.9 and 0.999. Finally, we picked the decay factor γ for Exponential log-uniformly in [0.9, 1.0]. Gradients were clipped to an absolute value of 100.0.