Network-to-Network Regularization: Enforcing Occam's Razor to Improve Generalization

Authors: Rohan Ghosh, Mehul Motani

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental we propose, in this work, a novel measure of complexity called Kolmogorov Growth (KG), which we use to derive new generalization error bounds that only depend on the final choice of the classification function. Guided by the bounds, we propose a novel way of regularizing neural networks by constraining the network trajectory to remain in the low KG zone during training. Minimizing KG while learning is akin to applying the Occam s razor to neural networks. The proposed approach, called network-to-network regularization, leads to clear improvements in the generalization ability of classifiers. We verify this for three popular image datasets (MNIST, CIFAR-10, CIFAR-100) across varying training data sizes.
Researcher Affiliation Academia Rohan Ghosh and Mehul Motani Department of Electrical and Computer Engineering N.1 Institute for Health Institute of Data Science National University of Singapore rghosh92@gmail.com, motani@nus.edu.sg
Pseudocode Yes Algorithm 1 N2N Regularization (Multi-Level)
Open Source Code Yes Code will be made available at https://github.com/rghosh92/N2N.
Open Datasets Yes We test N2N on three datasets: MNIST [17], CIFAR-10 [18] and CIFAR-100 [19].
Dataset Splits Yes For the CIFAR-100 datasets, we report the accuracies using a 48k-2k training-validation split of the data for both, as we find it to yield best performance (due to hard convergence).
Hardware Specification Yes Experiments were either carried out on an RTX 2060 GPU or a Tesla V100 or A100 GPU.
Software Dependencies No The paper mentions using specific network architectures (e.g., ResNet44, ResNet50, 5-layer CNN) but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes All networks were trained for a total of 200 iterations, and in each case results reported are averaged over five networks. For all experiments we set ebase = 3, esmall = 1 in Algorithm 1. The values of the regularization parameters (λ0, λ1) are provided in the supplementary material.