On the Unstable Convergence Regime of Gradient Descent

Authors: Shuo Chen, Jiaying Peng, Xiaolong Li, Yao Zhao

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we present a theoretical perspective to offer a qualitative analysis of this phenomenon. The unstable convergence is in fact an inherent property of GD for general twice differentiable functions. Specifically, the forwardinvariance of GD is established, i.e., it ensures that any point within a local region will always remain within this region under GD iteration. Then, based on the forward-invariance, for the initialization outside an open set containing the local minimum, the loss function will oscillate at the first several iterations and then become monotonely decreasing after the GD trajectory jumped into the open set. This work theoretically clarifies the unstable convergence phenomenon of GD discussed in previous experimental works. The unstable convergence of GD mainly depends on the selection of the initialization, and it is actually inevitable due to the complex nature of loss function. [...] Numerical Experiments In Theorem 2 and Theorem 3, we provide theoretical discussions concerning the forward-invariance, loss reduction, and convergence of GD. Especially near the minimum point, a distinct trend of decreasing convergence becomes prominent. The first row of Figure 2 (six figures) shows an example for the following function of two variables f1(x, y) = x2 + y2 + x2y + x3 + y3. [...] In Figure 3, both the GD trajectory and the loss changes are displayed for the two types of initializations.
Researcher Affiliation Academia Shuo Chen1, Jiaying Peng2, Xiaolong Li1*, Yao Zhao1 1Institute of Information Science, Beijing Jiaotong University, Beijing, 100044, China 2School of Mathematical Sciences, Capital Normal University, Beijing, 100048, China schen1307@foxmail.com, jiayingpeng.math@foxmail.com, lixl@bjtu.edu.cn, yzhao@bjtu.edu.cn
Pseudocode No The paper contains mathematical derivations and theorems but does not include any pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any statements about releasing source code or links to a code repository.
Open Datasets No The paper conducts numerical experiments using synthetically defined functions (e.g., f1(x, y) = x2 + y2 + x2y + x3 + y3), not external datasets that require public access information.
Dataset Splits No The paper performs numerical experiments on synthetically defined mathematical functions. It does not utilize traditional training, validation, or test dataset splits.
Hardware Specification No The paper does not mention any specific hardware used for running the numerical experiments (e.g., CPU, GPU models, cloud resources).
Software Dependencies No The paper does not specify any software dependencies with version numbers (e.g., programming languages, libraries, or solvers).
Experiment Setup Yes Specifically, the set of stable initializations Uη is depicted in cyan, while the set of unstable initializations, i.e., Vη\Uη, is shown in blue. When LR is chosen to be relatively small, GD is performed with all the points in the graph as initializations and the function values are monotonically decreasing, i.e., stable convergence. As LR increases, unstable convergence occurs, while the set of initializations leading to unstable convergence expands. [...] The third and fourth rows of Figure 2 demonstrate for the other tow functions f3 and f4 with complex sets of initializations for stable and unstable convergence, where f3(x, y) = x4y2 + x2y4 3x2y2 + 1 (45) and f4(x, y) = (1 xy)2 + 1 20(x2 + y2). [...] For f3 has four global minimums ( 1, 1) with the same λmax = 12 (and thus, the maximum LR is 1/6). [...] For f4, it has two global minimums ( 5)) with the same λmax = 19/5 (and thus, the maximum LR is 10/19). [...] In Figure 3, both the GD trajectory and the loss changes are displayed for the two types of initializations. The irregular loss oscillations seen in the previous section ultimately dissipate with iterations. The trajectory transitions from an unstable area to a stable one, finally converging to a local minimum. The figure provides a visual representation of this convergence process, where the red line signifies instability and the black line signifies stability. By comprehensively analyzing these convergence types along with the loss curves, we gained valuable insights into the behavior of our optimization process for the chosen functions. This approach enhanced our understanding of how the learning rate affects convergence stability and illuminated the complexities inherent in non-convex loss landscapes.