reproducibilityindex.ai

Marthe: Scheduling the Learning Rate Via Online Hypergradients

Authors: Michele Donini, Luca Franceschi, Orchid Majumder, Massimiliano Pontil, Paolo Frasconi

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We performed an extensive set of experiments in order to compare MARTHE, RTHO, and HD. We also considered a classic LR scheduling baseline in the form of exponential decay (Exponential) where the LR schedule is deﬁned by ηt = η0γt.
Researcher Affiliation	Collaboration	Michele Donini1 , Luca Franceschi2,3 , Orchid Majumder1 , Massimiliano Pontil2,3 and Paolo Frasconi4 1Amazon 2University College London, London, UK 3Istituto Italiano di Tecnologia, Genova, Italy 4Universit a di Firenze, Firenze, Italy donini@amazon.com; luca.franceschi@iit.it
Pseudocode	Yes	Algorithm 1 presents the pseudocode of MARTHE.
Open Source Code	Yes	Finally, our Py Torch implementation of the methods and the experimental framework to reproduce the results is available at https://github.com/awslabs/adatune.
Open Datasets	Yes	We trained three-layers feed forward neural networks with 500 hidden units per layer on a subset of 7000 MNIST [Le Cun et al., 1998] images.
Dataset Splits	Yes	We further sampled 700 images to form the validation set and deﬁned E to be the validation loss after T = 512 optimization steps (about 7 epochs).
Hardware Specification	Yes	We performed all experiments using AWS P3.2XL instances, each providing one NVIDIA Tesla V100 GPU.
Software Dependencies	No	The paper mentions 'Our Py Torch implementation' but does not specify version numbers for PyTorch or any other software libraries used in the experiments.
Experiment Setup	Yes	We used a cross-entropy loss and SGD as optimization dynamics Φ, with a mini-batch size of 100. We initialized η = 0.01 1512 for LRS-OPT and set η0 = 0.01 for MARTHE. We used two alternative optimization dynamics: SGDM with the momentum hyperparameter ﬁxed to 0.9 and Adam with the commonly suggested default values β1 = 0.9 and β2 = 0.999. We ﬁxed the batch size to 128, the initial learning rate η0 = 0.1 for SGDM and 0.003 for Adam, and the weight decay (i.e. 2-norm) to 5 10 4. For the adaptive methods, we sampled β in [10 3, 10 6] log-uniformly, and for our method, we sampled µ between 0.9 and 0.999. Finally, we picked the decay factor γ for Exponential log-uniformly in [0.9, 1.0]. Gradients were clipped to an absolute value of 100.0.