Toward Better Generalization Bounds with Locally Elastic Stability

Authors: Zhun Deng, Hangfeng He, Weijie Su

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To further motivate our study, note that there are many cases where the worst-case sensitivity of the loss is much larger than the average sensitivity, especially in random feature models or neural networks. As a concrete example, from Figure 1, we can observe that the sensitivity of neural networks and random feature models depends highly on the label information. To be precise, consider training two models on the CIFAR-10 dataset (Krizhevsky, 2009) and another dataset obtained by removing one training example, say an image of a plane, from CIFAR-10, respectively. Figure 1 shows that the difference between the loss function values for the two models depends on the label of the test image that the loss function is evaluated at
Researcher Affiliation Academia 1Harvard University 2Department of Computer and Information Science, University of Pennsylvania 3Wharton Statistics Department, University of Pennsylvania.
Pseudocode No No pseudocode or algorithm blocks were found in the paper. The paper presents mathematical definitions and derivations.
Open Source Code No The paper does not contain an explicit statement about releasing source code or a link to a code repository for the described methodology.
Open Datasets Yes Figure 1 shows that the difference between the loss function values for the two models depends on the label of the test image that the loss function is evaluated at: the difference between the loss function values, or sensitivity for short, is significant if the test image is another plane, and the sensitivity is small if the test image is from a different class, such as car or cat. Concretely, the average plane-toplane difference is about seven times the average plane-tocat difference. The dependence on whether the two images belong to the same class results in a pronounced diagonal structure in Figure 1(a), which is consistent with the phenomenon of local elasticity in deep learning training (He & Su, 2020; Chen et al., 2020). In particular, this structural property of the loss function differences clearly demonstrates that uniform stability fails to capture how sensitive the loss function is in the population sense, which is considerably smaller than the worst-case sensitivity, for the neural networks and random feature models. We further demonstrate the step-wise characterization of class-level sensitivity for neural networks (based on a pre-trained Res Net-18) and random feature models (based on a randomly initialized Res Net-18) trained for different numbers of epochs by SGD on CIFAR-10.
Dataset Splits No The paper mentions training and test data for CIFAR-10 but does not specify any validation dataset splits.
Hardware Specification No The paper mentions training models like
Software Dependencies No The paper mentions models like
Experiment Setup Yes To be precise, consider training two models on the CIFAR-10 dataset (Krizhevsky, 2009) and another dataset obtained by removing one training example, say an image of a plane, from CIFAR-10, respectively. Figure 1 shows that the difference between the loss function values for the two models depends on the label of the test image that the loss function is evaluated at: the difference between the loss function values, or sensitivity for short, is significant if the test image is another plane, and the sensitivity is small if the test image is from a different class, such as car or cat. Concretely, the average plane-toplane difference is about seven times the average plane-tocat difference. The dependence on whether the two images belong to the same class results in a pronounced diagonal structure in Figure 1(a), which is consistent with the phenomenon of local elasticity in deep learning training (He & Su, 2020; Chen et al., 2020). In particular, this structural property of the loss function differences clearly demonstrates that uniform stability fails to capture how sensitive the loss function is in the population sense, which is considerably smaller than the worst-case sensitivity, for the neural networks and random feature models. We further demonstrate the step-wise characterization of class-level sensitivity for neural networks (based on a pre-trained Res Net-18) and random feature models (based on a randomly initialized Res Net-18) trained for different numbers of epochs by SGD on CIFAR-10. Suppose that we run SGD with step sizes ηt 2/α for T steps. Suppose that we run SGD for T steps with monotonically non-increasing learning rate ηt c/t for some constant c > 0.