Optimizing Information-theoretical Generalization Bound via Anisotropic Noise of SGLD

Authors: Bohan Wang, Huishuai Zhang, Jieyu Zhang, Qi Meng, Wei Chen, Tie-Yan Liu

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We prove that with constraint to guarantee low empirical risk, the optimal noise covariance is the square root of the expected gradient covariance if both the prior and the posterior are jointly optimized. This validates that the optimal noise is quite close to the empirical gradient covariance. Technically, we develop a new information-theoretical bound that enables such an optimization analysis. We then apply matrix analysis to derive the form of optimal noise covariance. Presented constraint and results are validated by the empirical observations.
Researcher Affiliation Collaboration Bohan Wang University of Science & Technology of China Microsoft Research Asia Huishuai Zhang Microsoft Research Asia Jieyu Zhang University of Washington Microsoft Research Asia Qi Meng Microsoft Research Asia Wei Chen Microsoft Research Asia Tie-Yan Liu Microsoft Research Asia
Pseudocode No No structured pseudocode or algorithm blocks were found.
Open Source Code No The paper does not provide any explicit statements about open-sourcing the code for the described methodology or links to code repositories.
Open Datasets Yes We adopt the setting in [40] where a four-layer neural network for the Fashion-MNIST classification problem.
Dataset Splits Yes We use the full training dataset with 10000 samples and the full testing dataset with 10000 samples for evaluation.
Hardware Specification Yes The experiments are run on a single NVIDIA Tesla V100 GPU.
Software Dependencies Yes The code is implemented in PyTorch 1.7.0 and Python 3.8.3.
Experiment Setup Yes We adopt the setting in [40] where a four-layer neural network with 11330 parameters is used to conduct the classification task on the Fashion-MNIST except that we use 10000 training samples instead of 1200 used in [40]. We defer detailed settings of the experiments to Appendix F. The learning rate ηt is set to 0.001. The mini-batch size is set to 128.