Predicting Deep Neural Network Generalization with Perturbation Response Curves

Authors: Yair Schiff, Brian Quanz, Payel Das, Pin-Yu Chen

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we propose a new framework for evaluating the generalization capabilities of trained networks. We use perturbation response (PR) curves that capture the accuracy change of a given network as a function of varying levels of training sample perturbation. From these PR curves, we derive novel statistics that capture generalization capability. Specifically, we introduce two new measures for accurately predicting generalization gaps: the Gi-score and Pal-score... Using our framework applied to intra and inter-class sample mixup, we attain better predictive scores than the current state-of-the-art measures on a majority of tasks in the PGDL competition.
Researcher Affiliation Industry Yair Schiff1, Brian Quanz2, Payel Das2, Pin-Yu Chen2 1IBM Watson, 2IBM Research {yair.schiff@,blquanz@us.,daspa@us.,pin-yu.chen@}ibm.com
Pseudocode Yes This methodology is summarized in Algorithm 1 in Appendix A.3. ... We summarize this in Algorithm 2 in Appendix A.4. ... We give the pseudocode for the Pal-score in Algorithm 4 in Appendix A.6.
Open Source Code Yes We use the trained networks and their configurations, training data, and starting kit code from the competition; all open-sourced and provided under Apache 2.0 license1. The code includes utilities for loading models and model details and running scoring. To this base repository, we added our methods for performing different perturbations at different layers, computing PR curves, and computing our proposed Gi and Pal-scores. 1https://github.com/google-research/google-research/tree/master/pgdl
Open Datasets Yes The datasets are comprised of CIFAR-10 [28], SVHN [23], CINIC-10 [29], Oxford Flowers [30], Oxford Pets [31], and Fashion MNIST [32].
Dataset Splits No The paper does not explicitly provide details about training/validation/test dataset splits, but rather refers to evaluating on 'a sample of the training data' for generating PR curves and on 'test set' for generalization.
Hardware Specification Yes Each run is performed with 4 CPUs, 4 GB RAM, and 1 V100 GPU and batch size 128, submitted as resource-restricted jobs to a cluster.
Software Dependencies No The paper mentions using PyTorch and PyTorch Lightning, and references TensorFlow, but does not provide specific version numbers for these software dependencies in the context of their experiments.
Experiment Setup Yes For all models, we train with batch sizes of either 1024, 2048, or 4096 and learning rates of either 1e 4 or 1e 5. All models are trained with Adam optimization and a learning rate scheduler that reduced learning rate on training loss plateaus.