Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Gradient descent with generalized Newton’s method
Authors: Zhiqi Bu, Shiyun Xu
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present extensive experiments on language and vision tasks (e.g. GPT and Res Net) to showcase that Ge N optimizers match the state-of-the-art performance, which was achieved with carefully tuned learning rate schedulers. |
| Researcher Affiliation | Collaboration | Zhiqi Bu EMAIL Shiyun Xu University of Pennsylvania EMAIL |
| Pseudocode | Yes | Algorithm 1 Generalized Newton s optimizers (Ge N), e.g. γ = 0.9, Φ = 8 |
| Open Source Code | Yes | Equal contribution. Code available at https://github.com/ShiyunXu/AutoGeN. |
| Open Datasets | Yes | We train CIFAR10 (Krizhevsky et al., 2009) on Res Net 18, 34, 50, 152 (He et al., 2016) and Vi T tiny, small, base and large (Dosovitskiy et al., 2020). For fine tuning, we use the pretrained models from the Py Torch Image Models framework (Wightman, 2019). |
| Dataset Splits | Yes | We train CIFAR10 (Krizhevsky et al., 2009) on Res Net 18, 34, 50, 152 (He et al., 2016) and Vi T tiny, small, base and large (Dosovitskiy et al., 2020)... CIFAR10 and CIFAR100 are standard tiny image datasets that we have used as the test-bed... We evaluate Ro BERTa-base (Liu et al., 2019) on the GLUE (Wang et al., 2019) benchmark with Lo RA, Bit Fit and full-parameter training (FT). |
| Hardware Specification | No | Our default setting is full-parameter training (including mixed precision training), Φ = 1, and on single GPU (no communication cost among devices). |
| Software Dependencies | No | For fine tuning, we use the pretrained models from the Py Torch Image Models framework (Wightman, 2019). ... following the official Pytorch tutorial |
| Experiment Setup | Yes | Our default hyperparameters for Figure 1, Figure 2, Figure 3, Figure 4, Figure 5 are: B = 500, Φ = 4, SGD learning rate=1e-2, Adam W learning rate=1e-4, unless one of the hyperparameters are varied for the ablation study. ... In Figure 1, Figure 2, Figure 9 and Table 3, we follow the codebase of Hu et al. and use B = 256, sequence length 128, η0 = 1e 3, and 5 epochs. While applying, we set Φ = 4. ... Batch size Initial learning rate for FT # of epochs Eval metrics MRPC 128 2e-5 10 F1 SST2 128 1e-6 10 acc. MNLI 128 1e-6 5 (1 for FT) matched acc.&mismatched acc. Co LA 128 2e-5 10 Matthews corr. QNLI 128 2e-5 10 acc. QQP 256 2e-5 5 F1 RTE 128 2e-5 60 acc. |