Parameter Symmetry and Noise Equilibrium of Stochastic Gradient Descent

Authors: Liu Ziyin, Mingze Wang, Hongchao Li, Lei Wu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We apply our theory to understand specific problems and present numerical results in Section 5. All the proofs are presented in the Appendix.
Researcher Affiliation Collaboration Liu Ziyin Massachusetts Institute of Technology, NTT Research ziyinl@mit.edu Mingze Wang Peking University mingzewang@stu.pku.edu.cn Hongchao Li The University of Tokyo lhc@cat.phys.s.u-tokyo.ac.jp Lei Wu Peking University leiwu@math.pku.edu.cn
Pseudocode No No pseudocode or algorithm blocks were found.
Open Source Code No Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: The code or data of the experiments are simple and easy to reproduce following the description in the main text.
Open Datasets Yes Here, we give the details for the experiment in Figure 2. We train a two-layer linear net with d0 = d2 = 30 and d = 40. The input data is x N(0,1), and y = x+ϵ, where ϵ is i.i.d. Gaussian with unit variance.
Dataset Splits No The paper mentions training and testing phases but does not explicitly provide details about training/validation/test dataset splits, such as percentages or sample counts for a validation set.
Hardware Specification No For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [No] Justification: The experiments can be simply conducted on personal computers.
Software Dependencies No The paper does not explicitly list specific software dependencies with version numbers (e.g., Python version, library versions like PyTorch or TensorFlow).
Experiment Setup Yes Here, we give the details for the experiment in Figure 2. We train a two-layer linear net with d0 = d2 = 30 and d = 40. The input data is x N(0,1), and y = x+ϵ, where ϵ is i.i.d. Gaussian with unit variance. (Section A.2), when the learning rate (η = 0.008) is too large, SGD diverges (orange line). However, when one starts training at a small learning rate (0.001) and increases η to 0.008 after 5000 iterations, the training remains stable. (Figure 4 caption), Unless it is the independent variable, η, S and d are set to be 0.1, 100 and 2000, respectively. (Figure 8 caption).