Some Fundamental Aspects about Lipschitz Continuity of Neural Networks

Authors: Grigory Khromov, Sidak Pal Singh

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Thus, we carry out an empirical investigation in a range of different settings (namely, architectures, datasets, label noise, and more) by exhausting the limits of the simplest and the most general lower and upper bounds. As a highlight of this investigation, we showcase a remarkable fidelity of the lower Lipschitz bound, identify a striking Double Descent trend in both upper and lower bounds to the Lipschitz and explain the intriguing effects of label noise on function smoothness and generalisation.
Researcher Affiliation Academia Grigory Khromov a and Sidak Pal Singh a,b a Department of Computer Science, ETH Z urich b Max Planck ETH Center for Learning Systems
Pseudocode No The paper describes mathematical derivations and experimental procedures but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code is publicly available on Git Hub.
Open Datasets Yes (b) datasets: CIFAR-10 and CIFAR-100 (Krizhevsky, 2009), MNIST, MNIST1D (Greydanus, 2020) (a harder version of usual MNIST), as well as Image Net;
Dataset Splits Yes To compute the lower bound on convex combinations of samples from MNIST1D we constructed a set S , which contains: (a) training set S 4000 samples, (b) test set S 1000 samples, (c) convex combinations λxi + (1 λ)xj from S 100,000 samples for each λ = {0.1, 0.2, 0.3, 0.4, 0.5}, and (d) convex combinations λxi + (1 λ)xj from S 100,000 samples for each λ = {0.1, 0.2, 0.3, 0.4, 0.5}. Altogether this makes S contain 1,005,000 samples.
Hardware Specification No The paper mentions high computational complexity for certain models and datasets (e.g., "evaluating 1.2 million Jacobian matrices of size 1,000 x 150,528"), but it does not specify any particular GPU or CPU models, memory sizes, or specific hardware configurations used for the experiments.
Software Dependencies No The paper mentions using "scipy sparse CSR format and using scipy.sparse.linalg library" and "pytorch implementation" and "vit-pytorch Python package implementation", but does not provide specific version numbers for any of these software components.
Experiment Setup Yes We present experiments that compare models that have a substantially varying number of parameters. To minimise the effect of variability in training, we painstakingly enforce the same learning rate, batch size and optimiser configuration for all models in one sweep. ... For model fθ with parameter vector θ and loss on the training set L(θ, S), after the end of each epoch we compute θL(θ, S) 2, which we call gradient norm for simplicity. In all experiments, unless stated otherwise, we control model training by monitoring the respective gradient norm if it reaches a small value (ideally zero), our model has negligible parameter change (i.e. θt+1 θt 2 is small) or, in other words, has reached a local minimum. By means of experimentation, we found that stopping models at 0.01 gradient norm value gives good results for most scenarios. ... This section thoroughly describes learning rate schedulers (LR schedulers for short) that we use in our experiments.