The Effectiveness of Random Forgetting for Robust Generalization
Authors: Vijaya Raghavan T Ramkumar, Bahram Zonooz, Elahe Arani
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on benchmark datasets and adversarial attacks show that FOMO alleviates robust overfitting by significantly reducing the gap between the best and last robust test accuracy while improving the state-of-the-art robustness. Furthermore, FOMO provides a better trade-off between standard and robust accuracy, outperforming baseline adversarial methods. |
| Researcher Affiliation | Collaboration | 1Eindhoven University of Technology 2Tom Tom 3Wayve |
| Pseudocode | Yes | Algorithm 1 Adversarial Training with FOMO |
| Open Source Code | Yes | 1Code is available at https://github.com/NeurAI-Lab/FOMO. |
| Open Datasets | Yes | For our experiments, we use three datasets: CIFAR-10 (Krizhevsky et al., 2009), CIFAR100 (Krizhevsky et al., 2009), SVHN (Netzer et al., 2011). |
| Dataset Splits | Yes | We randomly split the original training sets for these datasets into a training set and a validation set in a 9:1 ratio. |
| Hardware Specification | Yes | To ensure a fair comparison, all methods were integrated into a universal training framework, and each test was performed on a single NVIDIA Ge Force 3090 GPU. |
| Software Dependencies | No | The paper mentions using the stochastic gradient descent (SGD) optimization algorithm and refers to standard adversarial training procedures, but it does not specify version numbers for any software libraries, programming languages, or frameworks used (e.g., Python, PyTorch, TensorFlow, CUDA versions). |
| Experiment Setup | Yes | The model was trained for a total of 200 epochs using the stochastic gradient descent (SGD) optimization algorithm with a momentum of 0.9, a weight decay of 5 10 4, and an initial learning rate of 0.1. For standard AT, we reduced the learning rate by a factor of 10 at the 100th and 150th epochs, respectively. (...) For Pre Act Res Net-18, we forgot a fixed s = 3.5% of the parameters in the later layers (Block-3 and Block-4) of the architecture, while for wider Res Net34-10, we forgot 5% as it has a larger capacity to memorize. Each forgetting step was followed by a relearning phase that lasted for er = 5 epochs. (...) For the consolidation step, we chose a decay rate of the stable model of c = 0.999. During the relearning phase, the stable model through the regularization loss (LCR), and we chose regularization strengths of λ1 and λ2 equal to 1. |