FOSI: Hybrid First and Second Order Optimization

Authors: Hadar Sivan, Moshe Gabel, Assaf Schuster

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical evaluation demonstrates that FOSI improves the convergence rate and optimization time of first-order methods such as Heavy-Ball and Adam, and outperforms second-order methods (K-FAC and L-BFGS).
Researcher Affiliation Academia Hadar Sivan Technion Haifa, Israel hadarsivan@cs.technion.ac.il Moshe Gabel York University Toronto, Canada mgabel@yorku.ca Assaf Schuster Technion Haifa, Israel assaf@cs.technion.ac.il
Pseudocode Yes The steps are summarized as Algorithm 1 in the Supplementary Material (Appendix A.1). ... Algorithm 2 provides the pseudocode for FOSI.
Open Source Code Yes An open source implementation of FOSI, available at: https://github.com/hsivan/fosi.
Open Datasets Yes Audio Classification (AC): Training Mobile Net V1 (approximately 4 million parameters) on the Audio Set dataset (Gemmeke et al., 2017). ... Language Model (LM): Training an RNN-based character-level language model ... on the Tiny Shakespeare dataset (Karpathy, 2015). ... Autoencoder (AE): Training an autoencoder model ... on the CIFAR-10 dataset. ... Transfer Learning (TL): Transfer learning from Image Net to CIFAR-10. ... Logistic Regression (LR): Training a multi-class logistic regression model to predict the 10 classes of the MNIST dataset.
Dataset Splits No The paper mentions using 'standard datasets' and reports 'validation accuracy' and 'validation loss', but it does not provide specific percentages or sample counts for training, validation, or test splits, nor does it explicitly state the splitting methodology or cite an external source for the splits used.
Hardware Specification Yes For experiments, we use an NVIDIA A40 GPU.
Software Dependencies Yes We implemented FOSI in Python using the JAX framework (Bradbury et al., 2018) 0.3.25.
Experiment Setup Yes We execute FOSI with k = 10 and ℓ= 0... We set α = 0.01, c = 3, and W such that warmup is one epoch. T is determined... resulting in T = 800 for all experiments. ... We use the standard learning rate for Adam (0.001), and the best learning rate for HB out of 0.1, 0.01, 0.001, with default momentum parameters β1 = 0.9, β2 = 0.999 for Adam and β = 0.9 for HB.