Tighter Analysis for ProxSkip
Authors: Zhengmian Hu, Heng Huang
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | Our main motivation comes from the continuous limit in which the original analysis of Prox Skip fails to guarantee convergence when both the step size γ and frequency p tend to zero. We construct a counterexample to demonstrate why such counterintuitive behavior occurs for the original analysis and then propose a novel Lyapunov function variant to construct a tighter analysis, avoiding the problem of the old one. Such a new Lyapunov function can be directly extended to many other variants of Prox Skip. When applied to stochastic gradient setup, our analysis leads to an improved proximal operator complexity for SProx Skip from O( p 1/εµ2 log(1/ε)) to O( κ log(1/ε)). |
| Researcher Affiliation | Academia | 1Department of Computer Science, University of Maryland, College Park, MD, USA. 2Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA, USA. |
| Pseudocode | Yes | Algorithm 1 Prox Skip |
| Open Source Code | No | This paper only includes basic numerical experimental results, which are separated in different sections, to verify and illustrate our theoretical analysis results. This is for two reasons: firstly, our main contribution is to derive a tighter theoretical analysis and the algorithms are not new; secondly, Prox Skip and other methods studied in this paper are abstract frameworks and have multiple instantiations suitable for different scenarios. Interested readers should refer to the papers where these algorithms were proposed for evaluation and comparison with other algorithms. |
| Open Datasets | No | The paper describes a "Toy Model Setup" in Appendix A.5 with synthetic functions: "f(x) = µ(κ 1) i=1 (x(i) x(i+1))2 + x(d) # ( 0 if x(1) = 0 + otherwise ", but this is a function definition rather than a pre-existing or provided public dataset. |
| Dataset Splits | No | The paper mentions experiments and plots, but does not provide specific details on training, validation, and test splits for the data used in these experiments. For example, it refers to "10 random samples of the numerical process" but no split percentages or counts. |
| Hardware Specification | No | The paper does not provide any specific hardware details used for running its experiments, such as CPU or GPU models, or memory specifications. |
| Software Dependencies | No | The paper does not list any specific software dependencies with version numbers needed to replicate the experiments. |
| Experiment Setup | Yes | We consider Nesterov s worst function in the world (Nesterov, 1998) in its strongly convex version as f(x) and an indicator function as ψ(x) (see Appendix A.5 for the details). The following definitions are used for producing Figures 2, 3 and 6. f(x) = µ(κ 1) i=1 (x(i) x(i+1))2 + x(d) # ( 0 if x(1) = 0 + otherwise , d = 10, κ = 10, µ = 0.1. The global optimum is x = 0. The initial point is x0 = [1, 0, 0, . . . ] and h0 = 0. When SProx Skip is used, we set gt(xt) = f(xt) + 0.1 e where e N(0, I). |