Generalized-Smooth Nonconvex Optimization is As Efficient As Smooth Nonconvex Optimization

Authors: Ziyi Chen, Yi Zhou, Yingbin Liang, Zhaosong Lu

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 7. Experiments In this section, we test our algorithms via solving the nonconvex phase retrieval problem in (10). We first compare deterministic algorithms with fine-tuned learning rate γ over 500 iterations. This includes the basic GD with γ = 8 10 4, clipped GD (Zhang et al., 2019) with γ = 0.9 and normalization term max( f(xt) , 100), and our β-GD with β = 1 3, 2 3, 1 and γ = 0.03, 0.1, 0.2, respectively. Figure 1 (top left) plots the comparison result on objective function value v.s. iteration.
Researcher Affiliation Academia 1Department of Electrical and Computer Engineering, University of Utah, Salt Lake City, UT, US 2Department of Electrical and Computer Engineering, Ohio State University, Columbus, OH, US 3Department of Industrial and Systems Engineering, University of Minnesota, Minneapolis, MN, US.
Pseudocode Yes Algorithm 1 β-Normalized GD Algorithm 2 SPIDER (Fang et al., 2018)
Open Source Code Yes The code can be downloaded from https://github.c om/changy12/Generalized-Smooth-Nonconvex-O ptimization-is-As-Efficient-As-Smooth-Non convex-Optimization
Open Datasets Yes We preprocess the data by filling the missing values with the median of the corresponding variables, censorizing and standardizing all the variables8, removing two categorical variables ( country and status ), and adding standard Gaussian noise to the target to ensure model robustness. We select the first 2000 samples {xi, yi}2000 i=1 as the training samples where xi R34 and yi R are feature and target respectively. In the DRO problem (12), we set λ = 0.01 and select ψ (t) = 1 4(t + 2)2 + 1 which corresponds to χ2 divergence. For any sample pair xξ, yξ, we adopt the regularized mean square loss function ℓξ(w) = 1 2(yξ x ξ w)2 + 0.1 P34 j=1 ln 1 + |w(j)|) with parameter w = [w(1); . . . ; w(34)] R34. We initialize η0 = 0.1 and randomly initialize w0 R34 entrywise using standard Gaussian distribution. (Life Expectancy data: https://www.kaggle.com/datasets/kumaraja rshi/life-expectancy-who?resource=download)
Dataset Splits No The paper states, "We select the first 2000 samples {xi, yi}2000 i=1 as the training samples" but does not provide explicit training, validation, and test dataset splits (e.g., percentages or counts for each subset) to ensure reproducibility of data partitioning.
Hardware Specification No The paper does not provide any specific details about the hardware used for running the experiments, such as CPU/GPU models, memory, or cloud computing instance types.
Software Dependencies No The paper mentions that the code is available for download but does not specify the versions of any programming languages, libraries, or other software dependencies used for the experiments. For example, while it implies Python might be used (from the code repository name), no specific version of Python or any deep learning frameworks (like TensorFlow or PyTorch) is mentioned with its version number.
Experiment Setup Yes We first compare deterministic algorithms with fine-tuned learning rate γ over 500 iterations. This includes the basic GD with γ = 8 10 4, clipped GD (Zhang et al., 2019) with γ = 0.9 and normalization term max( f(xt) , 100), and our β-GD with β = 1 3, 2 3, 1 and γ = 0.03, 0.1, 0.2, respectively... We further compare stochastic algorithms with fine-tuned learning rate γ and fixed batch size b = 50 over 500 iterations. This includes the basic SGD with γ = 2 10 4, normalized SGD with γ = 2 10 3, normalized SGD with momentum (Jin et al., 2021) with γ = 3 10 3 and momentum coefficient 10 4, clipped SGD (Zhang et al., 2019) with γ = 0.3 and normalization term max( f(zt) , 103), and SPIDER with γ = 0.01, epoch size q = 5 and batchsizes B = 3000, B = 50.