Spectral Evolution and Invariance in Linear-width Neural Networks
Authors: Zhichao Wang, Andrew Engel, Anand D Sarwate, Ioana Dumitriu, Tony Chiang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we show that the spectra of weight in this high dimensional regime are invariant when trained by gradient descent for small constant learning rates; we provide a theoretical justification for this observation and prove the invariance of the bulk spectra for both conjugate and neural tangent kernels. We demonstrate similar characteristics when training with stochastic gradient descent with small learning rates. We exhibit different spectral properties such as invariant bulk, spike, and heavy-tailed distribution from a two-layer neural network using different training strategies, and then correlate them to the feature learning. Analogous phenomena also appear when we train conventional neural networks with real-world data. |
| Researcher Affiliation | Collaboration | Zhichao Wang University of California San Diego zhw036@ucsd.edu Andrew Engel Pacific Northwest National Laboratory andrew.engel@pnnl.gov Anand Sarwate Rutgers, The State University of New Jersey ads221@soe.rutgers.edu Ioana Dumitriu University of California San Diego idumitriu@ucsd.edu Tony Chiang Pacific Northwest National Laboratory University of Washington University of Texas at El Paso tony.chiang@pnnl.gov |
| Pseudocode | No | No pseudocode or algorithm blocks are explicitly presented. |
| Open Source Code | No | The paper does not provide a direct statement about releasing its own source code for the described methodology or a link to a repository. |
| Open Datasets | Yes | First, we show the spectra of KNTK before and after training for binary classification on CIFAR-2 through small CNNs in Figure 4. We also investigate the spectral properties on the pre-trained model, BERT from [26], with fine-tuning on Sentiment140 dataset of tweets5 from [36]. |
| Dataset Splits | No | The paper mentions training and testing data but does not provide explicit percentages, counts, or specific methodology for dataset splits. |
| Hardware Specification | No | The paper does not provide specific details on the hardware used for experiments. |
| Software Dependencies | No | The paper mentions software like BERT, Adam, SGD, but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | Table 1: Four models with the same architecture (n = 2000, h = 1500, d = 1000, and σ is normalized tanh), but different choices of initial learning rates and optimizers listed in Table 1. The training label noise σε = 0.3 and the teacher model is defined by (9) with σ a normalized softplus and τ = 0.2. For fine-tuning, the learning rate is 0.003, the batch size is 64 and the momentum is 0.8. |