A Precise Characterization of SGD Stability Using Loss Surface Geometry
Authors: Gregory Dexter, Borja Ocejo, Sathiya Keerthi, Aman Gupta, Ayan Acharya, Rajiv Khanna
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 EXPERIMENTS In this section, we support our prior theorems by empirically evaluating the behavior of SGD on synthetic optimization problems with additively decomposable loss functions. |
| Researcher Affiliation | Collaboration | Gregory Dexter1, Borja Ocejo2, Sathiya Keerthi2, Aman Gupta2, Ayan Acharya2 & Rajiv Khanna1 1 Purdue University 2 Linked In Corporation |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found. |
| Open Source Code | Yes | To ensure reproducibility, we include all our implementations in the supplementary material. |
| Open Datasets | No | The paper uses 'synthetic optimization problems' and a 'construction used in the proof of Theorem 2' for its experiments, implying a generated dataset without providing concrete access information or referring to a well-known public dataset. |
| Dataset Splits | No | The paper conducts experiments on synthetic data to verify theoretical predictions about SGD divergence, but it does not involve traditional dataset splits for training, validation, or testing. The focus is on the behavior of SGD under specific conditions rather than model generalization on empirical datasets. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for running the experiments were provided in the paper. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers, such as programming languages, libraries, or solvers. |
| Experiment Setup | Yes | In this construction, we set Hi = m e1e T 1 for all i [σ] and Hi = m ei σ+1e T i σ+1 otherwise, with m = 2n/σ. We set the dimension of the space to n - σ +1... Across all experiments, we set n = 100. For each set of parameters (B, η, σ), we determine whether the combination leads to divergence or not by executing SGD for a maximum of 1000 steps. Specifically, we classify a tuple as divergent if, in the majority of the five repetitions, the norm of the parameter vector w increases by a factor of 1000. |