Optimizing Information-theoretical Generalization Bound via Anisotropic Noise of SGLD
Authors: Bohan Wang, Huishuai Zhang, Jieyu Zhang, Qi Meng, Wei Chen, Tie-Yan Liu
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We prove that with constraint to guarantee low empirical risk, the optimal noise covariance is the square root of the expected gradient covariance if both the prior and the posterior are jointly optimized. This validates that the optimal noise is quite close to the empirical gradient covariance. Technically, we develop a new information-theoretical bound that enables such an optimization analysis. We then apply matrix analysis to derive the form of optimal noise covariance. Presented constraint and results are validated by the empirical observations. |
| Researcher Affiliation | Collaboration | Bohan Wang University of Science & Technology of China Microsoft Research Asia Huishuai Zhang Microsoft Research Asia Jieyu Zhang University of Washington Microsoft Research Asia Qi Meng Microsoft Research Asia Wei Chen Microsoft Research Asia Tie-Yan Liu Microsoft Research Asia |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found. |
| Open Source Code | No | The paper does not provide any explicit statements about open-sourcing the code for the described methodology or links to code repositories. |
| Open Datasets | Yes | We adopt the setting in [40] where a four-layer neural network for the Fashion-MNIST classification problem. |
| Dataset Splits | Yes | We use the full training dataset with 10000 samples and the full testing dataset with 10000 samples for evaluation. |
| Hardware Specification | Yes | The experiments are run on a single NVIDIA Tesla V100 GPU. |
| Software Dependencies | Yes | The code is implemented in PyTorch 1.7.0 and Python 3.8.3. |
| Experiment Setup | Yes | We adopt the setting in [40] where a four-layer neural network with 11330 parameters is used to conduct the classification task on the Fashion-MNIST except that we use 10000 training samples instead of 1200 used in [40]. We defer detailed settings of the experiments to Appendix F. The learning rate ηt is set to 0.001. The mini-batch size is set to 128. |