Fast convergence of stochastic subgradient method under interpolation
Authors: Huang Fang, Zhenan Fan, Michael Friedlander
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present a formal analysis showing that SSGD for nonsmooth objectives could converge as fast as smooth objectives in the interpolation setting. Our contributions include: ... Proof that the iteration bound O(1/ϵ) is optimal... We now present some numerical experiments to compare the convergence of SSGD for training Re LU neural networks with smooth and nonsmooth loss functions. |
| Researcher Affiliation | Academia | Huang Fang, Zhenan Fan & Michael P. Friedlander Department of Computer Science University of British Columbia Vancouver, BC, Canada |
| Pseudocode | Yes | Algorithm 1 Stochastic subgradient descent. The learning rate function αt : N R+ returns the learning rate at iteration t. 1: Initialize: w(1) Rd 2: for t = 1, 2, . . . do 3: select i {1, 2, . . . , n} uniformly at random 4: compute g(t) fi(w(t)) 5: w(t+1) = w(t) αtg(t) |
| Open Source Code | No | The paper does not provide an explicit statement or link for open-source code for the described methodology. |
| Open Datasets | Yes | We train the Le Net (Lecun et al., 1998) on the MNIST dataset to classify 4 s and 9 s. |
| Dataset Splits | No | The paper mentions using training data and the MNIST dataset but does not specify any explicit training, validation, or test dataset splits (e.g., percentages or sample counts). |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions neural network architectures and loss functions, but does not provide specific version numbers for any software dependencies (e.g., programming languages, libraries, or frameworks). |
| Experiment Setup | Yes | We randomly generate a small one hidden layer neural network with 16 neurons and Re LU activation as the teacher network... We overparameterize the student neural network and set it to be a one hidden layer network with 512 neurons and Re LU activation. We train the student network with different loss functions: squared loss e.g., 1/n Pn i=1(yi ˆyi)2 and absolute loss e.g., 1/n Pn i=1 |yi ˆyi| with different learning rates. We train the Le Net (Lecun et al., 1998) on the MNIST dataset... Then we run SSGD to train the model with difference loss functions: logistic loss e.g., 1/n Pn i=1 log(1+exp(−yi ˆyi)) and L1-hinge loss e.g., 1/n Pn i=1 max{0, 1 − yiˆyi} and with different learning rates. |