Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Second-order Optimization under Heavy-Tailed Noise: Hessian Clipping and Sample Complexity Limits
Authors: Abdurakhmon Sadiev, Peter Richtarik, Ilyas Fatkhullin
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results are primarily theoretical and validated through synthetic experiments with controllable heavy-tailed noise. While one component of our algorithm (in its unclipped form) has previously been applied in reinforcement learning settings [Fatkhullin et al., 2023], its broader practical deployment remains an open challenge. In particular, several questions must be addressed to make these methods viable in real-world applications: How can we reduce hyperparameter tuning overhead? Are there default configurations that are robust across problem classes? Is clipping truly necessary, or can it be systematically replaced by normalization techniques, as recent work suggests in the first-order context [Hübler et al., 2025]? We view our contributions as a foundational step towards a principled understanding of second-order methods under realistic, heavy-tailed noise conditions, and hope they spur further theoretical and algorithmic advances in this direction. |
| Researcher Affiliation | Academia | Abdurakhmon Sadiev KAUST Center of Excellence for Generative AI Peter Richtárik KAUST Center of Excellence for Generative AI Ilyas Fatkhullin ETH Zurich, ETH AI Center, Georgia Institute of Technology |
| Pseudocode | Yes | Algorithm 1 NSGDHess (Normalized SGD with Hessian correction) 1: Input: Starting point x0 Rd, a vector g0 Rd, a stepsize γ > 0, momentum parameters α > 0, the number of iterations T. 2: x1 = x0 γ g0 g0 3: for t = 1, 2, . . . , T 1 do 4: Sample qt U ([0, 1]) 5: ˆxt = qtxt + (1 qt)xt 1 6: Sample ξt, ˆξt D independently 7: gt = (1 α) gt 1 + 2f(ˆxt, ˆξt)(xt xt 1) + α f(xt, ξt) 8: xt+1 = xt γ gt gt 9: end for 10: Output: |
| Open Source Code | No | Our work s focus is advancing theory of stochastic optimization, so this question is not applicable. Our one numerical illustration is a very toy experiment involving 10 dimensional quadratic problem with synthetic heavy-tailed noise. |
| Open Datasets | No | Figure 2: Performance of algorithms on a simple problem, F(x) = 0.5 x 2, d = 10 with synthetic noise generated from a two-sided Pareto distribution with tail index p = 1.1. We observe that algorithms without clipping, NSGDM and NSGDHess, suffer significantly from noise. This motivates our more in-depth study involving gradient and Hessian clipping for high probability convergence. |
| Dataset Splits | No | Figure 2: Performance of algorithms on a simple problem, F(x) = 0.5 x 2, d = 10 with synthetic noise generated from a two-sided Pareto distribution with tail index p = 1.1. We observe that algorithms without clipping, NSGDM and NSGDHess, suffer significantly from noise. This motivates our more in-depth study involving gradient and Hessian clipping for high probability convergence. |
| Hardware Specification | No | The illustration is toy and can be run on any device without advanced compute resources. |
| Software Dependencies | No | The text does not specify any software dependencies with version numbers. |
| Experiment Setup | Yes | 1. Effect of Clipping Level on Iteration Complexity. We run Algorithm 2 (Clip NSGDHess) with λh = λ until it reached the target accuracy F(x) ≤ 3/2. The stepsize and momentum parameters are set to γ = 0.01 and α = 0.2. Clipping levels are varied from 10−16 to 103 in multiplicative steps of 10. In Figure 3, we observe that intermediate clipping (λ = 10) yields the lowest iteration complexity, also refer to Table 2 for precise numerical values. Extremely small or large values lead to slower convergence. The proposed algorithm tolerates very small values of λ, but very large values (e.g., λ > 103) result in very slow convergence. This empirically confirms the need for Hessian clipping to stabilize convergence and achieve fast convergence. ... 2. Comparison with Clip NSGDM under Varying Tail Index. We compare Clip NSGDHess with Clip NSGDM (which was proposed in Cutkosky and Mehta [2021]). For Clip NSGDM, we used the theoretical parameter choices: stepsize γ = T^(2p−1)/(3p−2) and momentum α = T^(−(p−1)/(3p−2)). For both methods we fixed T = 4000, λ = 0.5, and λh = 0.05. These experiments were conducted to complement Figure 1, which presents the theoretical iteration complexities. As shown in Figure 4, the empirical results align with the theory: Clip NSGDHess consistently outperforms Clip NSGDM, and the performance gap becomes larger as p increases. This provides strong empirical support for our theoretical findings. |