Non-asymptotic Global Convergence Analysis of BFGS with the Armijo-Wolfe Line Search
Authors: Qiujiang Jin, Ruichen Jiang, Aryan Mokhtari
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we present the first explicit and non-asymptotic global convergence rates of the BFGS method when implemented with an inexact line search scheme satisfying the Armijo-Wolfe conditions. We show that BFGS achieves a global linear convergence rate of (1 1 κ)t for µ-strongly convex functions with L-Lipschitz gradients, where κ = L µ represents the condition number. Additionally, if the objective function s Hessian is Lipschitz, BFGS with the Armijo-Wolfe line search achieves a linear convergence rate that depends solely on the line search parameters, independent of the condition number. We also establish a global superlinear convergence rate of O(( 1 t )t). These global bounds are all valid for any starting point x0 and any symmetric positive definite initial Hessian approximation matrix B0, though the choice of B0 impacts the number of iterations needed to achieve these rates. By synthesizing these results, we outline the first global complexity characterization of BFGS with the Armijo-Wolfe line search. Additionally, we clearly define a mechanism for selecting the step size to satisfy the Armijo-Wolfe conditions and characterize its overall complexity. [...] 8 Numerical Experiments We conduct numerical experiments on a cubic objective function defined as i=1 g(v i x v i+1x) βv 1 x 2 x 2, (29) and g : R R is defined as 3|w|3 |w| , w2 2|w| + 1 3 3 |w| > , (30) where α, β, λ, R are hyper-parameters and {vi}n i=1 are standard orthogonal unit vectors in Rd. We focus on this objective function because it is used in [26] to establish a tight lower bound for second-order methods. We compare the convergence paths of BFGS with an inexact line search step size ηt that satisfies the Armijo-Wolfe conditions (5) and (6) for various initialization matrices B0: specifically, B0 = LI, B0 = µI, B0 = I, and B0 = c I where c is defined in Remark 6.2. It is easily verified that c [µ, L]. We also compare the performance of BFGS methods to the gradient descent (GD) method with backtracking line search, using α = 0.1 in condition (5) and β = 0.9 in condition (6). Step size ηt is chosen at each iteration via log bisection in Algorithm 1. Empirical results are compared across various dimensions d and condition numbers κ, with the x-axis representing the number of iterations t and the y-axis showing the ratio f(xt) f(x ) f(x0) f(x ). |
| Researcher Affiliation | Academia | Qiujiang Jin ECE, UT Austin qiujiangjin0@gmail.com Ruichen Jiang ECE, UT Austin rjiang@utexas.edu Aryan Mokhtari ECE, UT Austin mokhtari@austin.utexas.edu |
| Pseudocode | Yes | J Log Bisection Algorithm for Weak Wolfe Conditions Algorithm 1 Log Bisection Algorithm for Weak Wolfe Conditions |
| Open Source Code | No | The paper does not explicitly state that open-source code for the methodology is provided, nor does it include a link to a code repository. |
| Open Datasets | No | The numerical experiments are conducted on a synthetic "cubic objective function" defined by equations (29) and (30), not a publicly available dataset. |
| Dataset Splits | No | The paper uses a synthetic objective function for numerical experiments and does not describe traditional training, validation, or test dataset splits. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for the numerical experiments (e.g., CPU, GPU models, memory). |
| Software Dependencies | No | The paper mentions "BFGS method", "gradient descent (GD) method with backtracking line search", and "log bisection in Algorithm 1" but does not specify any software names with version numbers (e.g., Python, PyTorch, NumPy). |
| Experiment Setup | Yes | We compare the convergence paths of BFGS with an inexact line search step size ηt that satisfies the Armijo-Wolfe conditions (5) and (6) for various initialization matrices B0: specifically, B0 = LI, B0 = µI, B0 = I, and B0 = c I where c is defined in Remark 6.2. It is easily verified that c [µ, L]. We also compare the performance of BFGS methods to the gradient descent (GD) method with backtracking line search, using α = 0.1 in condition (5) and β = 0.9 in condition (6). Step size ηt is chosen at each iteration via log bisection in Algorithm 1. Empirical results are compared across various dimensions d and condition numbers κ, with the x-axis representing the number of iterations t and the y-axis showing the ratio f(xt) f(x ) f(x0) f(x ). |