Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Improving the Straight-Through Estimator with Zeroth-Order Information

Authors: Ningfeng Yang, Tor Aamodt

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we show FOGZO improves the tradeoff between quality and training time in Quantization-Aware Pre-Training. Specifically, versus STE at the same number of iterations, we show a 1-8% accuracy improvement for Dei T Tiny/Small, 1-2% accuracy improvement on Res Net 18/50, and 1-22 perplexity point improvement for LLa MA models with up to 0.3 billion parameters. Experiments across a wide variety of benchmarks suggest that FOGZO can outperform the STE, while requiring less computation than n-SPSA. We first compare the STE, n-SPSA, and FOGZO with experiments on shallow networks in Section 4.1, as running n-SPSA with large n is infeasible on deeper networks. In Section 4.2, we compare the STE and FOGZO on deeper networks with n = 1. Lastly, in Section 4.3, we compare the evaluation quality of FOGZO against the STE under the same training time.
Researcher Affiliation Academia Ningfeng Yang University of British Columbia EMAIL Tor M. Aamodt University of British Columbia EMAIL
Pseudocode Yes Algorithm 1 FOGZO Gradient Descent
Open Source Code Yes Code is available at https://github.com/1733116199/fogzo.
Open Datasets Yes We train MNIST on a 2-layer MLP (784-10-10) with the identity STE... For datasets, we use Imagenet-1K (41), Imagenet-100(7), and C4 (39).
Dataset Splits Yes We train MNIST on a 2-layer MLP (784-10-10)... For datasets, we use Imagenet-1K (41), Imagenet-100(7), and C4 (39)... We adapt the training recipe from (18) for Resnets using the source code from (37), the training recipe from (48) for Dei Ts, and the training recipe from (53) for LLa MAs.
Hardware Specification Yes All models with fewer than 30 million parameters are trained on Nvidia RTX 2080 Ti GPUs with 11 GB of memory. All models with parameter counts between 30 million and 200 million are trained on Nvidia RTX 5090 with 32 GB of RAM. All models with more than 200 million parameters are trained on Nvidia A100 80GB. We use LSQ-STE and LSQ-FOGZO to train a LLa MA model with 30M non-embedding parameters on one Nvidia RTX 5090 GPU and report the training time and C4 evaluation perplexity in Table 4.
Software Dependencies No The paper mentions 'Adam W optimizer' and 'Cosine Annealing scheduler' and implicitly refers to 'Pytorch' through code references like 'pytorch/examples', but does not provide specific version numbers for these software components or any other libraries.
Experiment Setup Yes We use a batch size of 512 and a learning rate of 2e-3 * batch_size / 32. We train for 10 epochs on MNIST. We use the Adam W optimizer and the Cosine Annealing scheduler. For FOGZO, we use β {0.666, 0.9, 0.999, 1}. We use linearly scheduled βs with the hyperparameter βmin. We follow the training recipes and hyperparameters of (18), (48), (53), (35). We make no changes to the training recipes with a few exceptions for simplicity: we do not use EMA evaluation; we do not use gradient accumulation and simply follow the linear scaling rule (16) to scale batch size and learning rate; we keep Batch Norm in training mode but freeze the running stats when we measure the loss of the perturbed model.