The Effectiveness of Random Forgetting for Robust Generalization

Authors: Vijaya Raghavan T Ramkumar, Bahram Zonooz, Elahe Arani

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on benchmark datasets and adversarial attacks show that FOMO alleviates robust overfitting by significantly reducing the gap between the best and last robust test accuracy while improving the state-of-the-art robustness. Furthermore, FOMO provides a better trade-off between standard and robust accuracy, outperforming baseline adversarial methods.
Researcher Affiliation Collaboration 1Eindhoven University of Technology 2Tom Tom 3Wayve
Pseudocode Yes Algorithm 1 Adversarial Training with FOMO
Open Source Code Yes 1Code is available at https://github.com/NeurAI-Lab/FOMO.
Open Datasets Yes For our experiments, we use three datasets: CIFAR-10 (Krizhevsky et al., 2009), CIFAR100 (Krizhevsky et al., 2009), SVHN (Netzer et al., 2011).
Dataset Splits Yes We randomly split the original training sets for these datasets into a training set and a validation set in a 9:1 ratio.
Hardware Specification Yes To ensure a fair comparison, all methods were integrated into a universal training framework, and each test was performed on a single NVIDIA Ge Force 3090 GPU.
Software Dependencies No The paper mentions using the stochastic gradient descent (SGD) optimization algorithm and refers to standard adversarial training procedures, but it does not specify version numbers for any software libraries, programming languages, or frameworks used (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup Yes The model was trained for a total of 200 epochs using the stochastic gradient descent (SGD) optimization algorithm with a momentum of 0.9, a weight decay of 5 10 4, and an initial learning rate of 0.1. For standard AT, we reduced the learning rate by a factor of 10 at the 100th and 150th epochs, respectively. (...) For Pre Act Res Net-18, we forgot a fixed s = 3.5% of the parameters in the later layers (Block-3 and Block-4) of the architecture, while for wider Res Net34-10, we forgot 5% as it has a larger capacity to memorize. Each forgetting step was followed by a relearning phase that lasted for er = 5 epochs. (...) For the consolidation step, we chose a decay rate of the stable model of c = 0.999. During the relearning phase, the stable model through the regularization loss (LCR), and we chose regularization strengths of λ1 and λ2 equal to 1.