Optimizing Mode Connectivity via Neuron Alignment

Authors: Norman Tatro, Pin-Yu Chen, Payel Das, Igor Melnyk, Prasanna Sattigeri, Rongjie Lai

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically verify that the permutation given by alignment is locally optimal via a proximal alternating minimization scheme. Empirically, optimizing the weight permutation is critical for efficiently learning a simple, planar, low-loss curve between networks that successfully generalizes. Our alignment method can significantly alleviate the recently identified robust loss barrier on the path connecting two adversarial robust models and find more robust and accurate models on the path. Code is available at https://github.com/IBM/Neuron Alignment.
Researcher Affiliation Collaboration N. Joseph Tatro Dept. of Mathematical Sciences Rensselaer Polytechnic Institute Troy, NY tatron@rpi.edu, Pin-Yu Chen IBM Research Yorktown Heights, NY pin-yu.chen@ibm.com, Payel Das IBM Research Yorktown Heights, NY daspa@us.ibm.com, Igor Melnyk IBM Research Yorktown Heights, NY igor.melnyk@ibm.com, Prasanna Sattigeri IBM Research Yorktown Heights, NY psattig@us.ibm.com, Rongjie Lai Dept. of Mathematical Sciences Rensselaer Polytechnic Institute Troy, NY lair@rpi.edu
Pseudocode Yes Algorithm 1: Permutation via Neuron Alignment
Open Source Code Yes Code is available at https://github.com/IBM/Neuron Alignment.
Open Datasets Yes We trained neural networks to classify images from CIFAR10 and CIFAR100 (Krizhevsky et al., 2009), as well as Tiny Image Net (Deng et al., 2009).
Dataset Splits Yes The default training and test set splits are used for each dataset. 20% of the images in the training set are used for computing alignments between pairs of models.
Hardware Specification Yes Models were trained on NVIDIA 2080 Ti GPUs.
Software Dependencies No The paper does not specify software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup Yes We set a learning rate of 1E-1 that decays by a factor of 0.5 every 20 epochs. Weight decay of 5E-4 was used for regularization. Each model was trained for 250 epochs, and all models were seen to converge. Curves are trained for 250 epochs using SGD with a learning rate of 1E-2 and a batch size of 128. The rate anneals by a factor of 0.5 every 20 epochs.