Anticorrelated Noise Injection for Improved Generalization
Authors: Antonio Orvieto, Hans Kersting, Frank Proske, Francis Bach, Aurelien Lucchi
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct an extensive set of experiments ranging from shallow neural networks to deep architectures with real data (e.g. CIFAR 10) and we demonstrate that Anti-PGD, indeed, reliably finds minima that are both flatter and generalize better than the ones found by standard GD or PGD. |
| Researcher Affiliation | Academia | 1Department of Computer Science, ETH Zurich, Switzerland 2INRIA Ecole Normale Sup erieure PSL Research University, Paris, France 3Department of Mathematics, University of Oslo, Norway 4Department of Mathematics and Computer Science, University of Basel, Switzerland. |
| Pseudocode | No | The paper describes algorithms using mathematical equations (e.g., Eq. 1 and 2 for PGD and Anti-PGD) but does not include any structured pseudocode blocks or algorithm listings. |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating the availability of source code for the methodology described. |
| Open Datasets | Yes | We conduct an extensive set of experiments ranging from shallow neural networks to deep architectures with real data (e.g. CIFAR 10) |
| Dataset Splits | No | The paper mentions 'train loss' and 'test loss' but does not explicitly specify the division of data into training, validation, and test sets, nor does it provide percentages or sample counts for these splits. |
| Hardware Specification | No | To approximate full-batch gradient descent we use a very large batch size of 7500 samples (i.e. until saturation of 5 GPUs). While '5 GPUs' is mentioned, no specific GPU models (e.g., NVIDIA A100, Tesla V100) or other hardware specifications (CPU, RAM) are provided. |
| Software Dependencies | No | The paper mentions using 'a simple SGD optimizer (with momentum 0.9)' and 'Res Net18-like architecture' but does not specify software names with version numbers (e.g., PyTorch 1.x, TensorFlow 2.x) that would be needed for reproducibility. |
| Experiment Setup | Yes | Here, to keep things simple, we train with a simple SGD optimizer (with momentum 0.9), and select a learning rate of 0.05. To approximate full-batch gradient descent we use a very large batch size of 7500 samples (i.e. until saturation of 5 GPUs). For SGD, we instead select a batch size of 128, and keep the learning rate at 0.05. For convergence of the test accuracy and the Hessian trace, it is convenient to kill the noise injection after 250 epochs so that the optimizer converges to the nearest minimum. In this experiment, we keep the parameter settings as in the last paragraph, but instead consider injecting noise only after 75 epochs. |