Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AdaFisher: Adaptive Second Order Optimization via Fisher Information

Authors: Damien GOMES, Yanlei Zhang, Eugene Belilovsky, Guy Wolf, Mahdi S. Hosseini

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that Ada Fisher outperforms the SOTA optimizers in terms of both accuracy and convergence speed. Code is available from https://github.com/AtlasAnalyticsLab/AdaFisher. ... To evaluate Ada Fisher, we conduct experiments on six benchmark datasets across Image Classification for Computer Vision (CV) and Language Modeling for Natural Language Processing (NLP) that are commonly used to evaluate optimization algorithms: CIFAR-10, CIFAR100 (Krizhevsky et al., 2009), Tiny Image Net (Le & Yang, 2015), and Image Net-1k (Deng et al., 2009) for image classification; Wikitext-2 (Merity et al., 2017) and Penn Treebank (PTB) (Marcus et al., 1993) for language modeling. The six baseline methods we compare with are SGD, Adam/Adam W, K-FAC, Ada Hessian, and Shampoo. For CIFAR experiments, we report the average over five runs. We also perform a transfer learning task using the Image Net-1k weights from Paszke et al. (2019). Detailed descriptions of the experimental setup (including HP tuning, datasets, and data augmentation), results, and analyses are provided in Appendix D.
Researcher Affiliation Academia Damien Martins Gomes Concordia University and IPSA Toulouse EMAIL Yanlei Zhang Universit e de Montr eal and Mila EMAIL Eugene Belilovsky Concordia University and Mila EMAIL Guy Wolf Universit e de Montr eal and Mila EMAIL Mahdi S. Hosseini Concordia University and Mila EMAIL
Pseudocode Yes The implementation for both Ada Fisher variants is delineated in the pseudo-code presented in Algorithm 1. Algorithm 1 Ada Fisher optimization algorithm.
Open Source Code Yes Code is available from https://github.com/AtlasAnalyticsLab/AdaFisher.
Open Datasets Yes To evaluate Ada Fisher, we conduct experiments on six benchmark datasets across Image Classification for Computer Vision (CV) and Language Modeling for Natural Language Processing (NLP) that are commonly used to evaluate optimization algorithms: CIFAR-10, CIFAR100 (Krizhevsky et al., 2009), Tiny Image Net (Le & Yang, 2015), and Image Net-1k (Deng et al., 2009) for image classification; Wikitext-2 (Merity et al., 2017) and Penn Treebank (PTB) (Marcus et al., 1993) for language modeling. ... Specifically, we calculate the true Fisher using the NNgeometry Python package (George, 2021), which facilitates the computation of the FIM, Gauss-Newton Matrix, or Neural Tangent Kernels applied to neural networks. ... on a subset of the MNIST dataset (Deng, 2012) over 50 epochs.
Dataset Splits Yes D.2.2 DATASET DETAILS CIFAR. The training/test sets for Cifar10/100 dataset contain 50k/10k images, respectively. ... Tiny Image Net. The training/test sets for Tiny Image Net Le & Yang (2015) contains 100k/10k images. ... Image Net-1k. The training/test sets for Image Net-1k Russakovsky et al. (2015) contains 1,281,167/150k images. ... D.3.1 DATASET DETAILS The Wikitext-2 dataset, derived from high-quality Wikipedia articles, contains over two million words and is structured into training, validation, and test sets.
Hardware Specification Yes D.1 HARDWARE In total, we had a server with 6 NVIDIA RTX 6000 Ada Generation GPUS with 48 gigabytes of VRAM and 128 gigabytes of RAM available for all experiments. All experiments described in this report were conducted on a system equipped with a single NVIDIA RTX 6000 Ada Generation GPU and 64 gigabytes of RAM, except for training Ada Fisher on Image Net-1k with batch sizes of 512 and 1024, where four GPUs were utilized.
Software Dependencies No The paper mentions using PyTorch and the ASDL library but does not provide specific version numbers for these or other software components. For example: "For the Shampoo and K-FAC optimizers, we utilized the ASDL library as implemented in Py Torch provided by Osawa et al. (2023)." and "...we calculate the true Fisher using the NNgeometry Python package (George, 2021)..."
Experiment Setup Yes D.2.1 HP TUNING Effective HP tuning is crucial for optimizing the performance of deep learning models. In this study, we systematically explored various HPs for both CNNs and Vi Ts across multiple image classification tasks. The following subsections detail the tuning strategies employed for each model architecture and dataset. CNNs. For all image classification tasks involving CNNs, we utilized Res Net18 as the backbone architecture and evaluated its performance on the CIFAR-10 dataset with a fixed batch size of 256 trained on 50 epochs. The HP tuning process encompassed the following components: Optimizer Selection and Learning Rate Tuning: Each optimizer was fine-tuned using Res Net18 on CIFAR-10. We performed a grid search to identify the optimal learning rate from the set {0.0001, 0.0003, 0.0005, 0.0009, . . . , 0.1, 0.3, 0.5, 0.9}. Learning Rate Scheduling: A cosine annealing learning rate decay strategy was employed, aligning with the number of training epochs specified for each optimizer in Table 8. ... Weight Decay: We applied a uniform weight decay of 5 10 4 across all optimizers for CIFAR-10 and Tiny Image Net. An exception was made for Mobile Net V3, where the weight decay was set to 1 10 5. For experiments on Image Net-1k, the weight decay was established at 1 10 4. Damping Parameter Tuning: Ada Fisher, K-FAC, and Shampoo: * K-FAC and Ada Fisher: The damping parameter was searched within {0.0001, 0.0003, 0.0005, 0.0009, 0.001, 0.003, 0.005, 0.009, 0.01, 0.03, 0.05, 0.09}. ... Ada Fisher Decay Factors: The decay factor γ for Ada Fisher was tuned within {0.1, 0.2, . . . , 0.9, 0.99}. The optimal value is: γ = 0.8.