Learning Bregman Divergences with Application to Robustness

Authors: Mohamed-Hicham LEGHETTAS, Markus Püschel

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose a novel and general method to learn Bregman divergences from raw high-dimensional data that measure similarity between images in pixel space. As a prototypical application, we learn divergences that consider real-world corruptions of images (e.g., blur) as close to the original and noisy perturbations as far, even if in Lp-distance the opposite holds. We also show that the learned Bregman divergence excels on datasets of human perceptual similarity judgment, suggesting its utility in a range of applications. We then define adversarial attacks by replacing the projected gradient descent (PGD) with the mirror descent associated with the learned Bregman divergence, and use them to improve the state-of-the-art in robustness through adversarial training for common image corruptions. In particular, for the contrast corruption that was found problematic in prior work we achieve an accuracy that exceeds the Lpand the LPIPS-based adversarially trained neural networks by a margin of 27.16% on the CIFAR-10-C corruption data set.
Researcher Affiliation Academia Mohamed-Hicham Leghettas Department of Computer Science ETH Zurich, Switzerland mleghettas@inf.ethz.ch Markus Püschel Department of Computer Science ETH Zurich, Switzerland pueschel@inf.ethz.ch
Pseudocode Yes In this section, we provide the pseudo-code of two majors phases of our method. First, Alg. 1 is for the training of BD that was discussed in Sec. 3.2 and its inverse map presented in Sec. 4. Our instantiation of the mirror descent procedure used for adversarial training (see Sec. 4) is detailed in Sec.2. In practice, all these training procedures are performed on batches of images but here we present them for one image. We also omit the validation loops and early stopping conditions to improve readability. Algorithm 1 Self-supervised BD training
Open Source Code No The source code is not part of the supplemental material because it is released upon publication under a non-anonymized GPLv2 license.
Open Datasets Yes We perform experiments on CIFAR-10 [43] and consider the 14 noise-free corruptions from CIFAR-10-C [33] that can be applied with severities from 1 to 5... Specifically, we use the Berkeley-Adobe Perceptual Patch Similarity (BAPPS) data set [81]...
Dataset Splits No The paper does not explicitly provide specific training/validation/test dataset splits with percentages or counts. While it mentions using a 'test set', it does not detail a validation split or how it was separated from training data. Appendix B states 'We also omit the validation loops and early stopping conditions to improve readability' in pseudocode, implying their existence but without providing specific split details.
Hardware Specification Yes In practice, our method requires about twice the runtime as the standard AT when implemented in Py Torch and run on a single V100 GPU... Scaling up and training on larger data sets with larger image sizes should be easily straightforward with more GPUs, instead of the one V100 GPU we had access to.
Software Dependencies No The paper mentions software like 'Py Torch', 'Adam optimizer [40]', and 'Adam W optimizer [49]' but does not provide specific version numbers for these dependencies, which are necessary for reproducible descriptions.
Experiment Setup Yes For both the base function ϕ and its conjugate ϕ we use the same architecture: an ICNN with 12 convolutional layers followed by 4 fully connected layers. The strong-convexity parameter is chosen as α = 10 3. This identity training is performed for 7,000 steps using the Adam optimizer [40] with a batch size of 64, a learning rate of 3 10 4 and no weight decay... The training batch contains 32 clean images, one corrupted image for each clean image, and m = 63 samples of noisy images per clean image (2,080 images in total). The training is performed for 10 epochs using the Adam W optimizer [49] with an initial learning rate of 10 4 and a weight decay of 10 9... For the classification model f, we use the Pre Act Res Net-18 architecture [32]... AT is performed using the SGD optimizer for 150 epochs with a learning rate of 0.1 that decays by a factor of 10 each 50 epochs, a batch size of 128, and a weight decay of 5 10 4. These are the same hyperparameters for which RLAT performs the best. The RLAT radius is taken to be 0.08.