Temperature Balancing, Layer-wise Weight Analysis, and Neural Network Training

Authors: Yefan Zhou, TIANYU PANG, Keqin Liu, charles martin, Michael W. Mahoney, Yaoqing Yang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We implement Temp Balance on CIFAR10, CIFAR100, SVHN, and Tiny Image Net datasets using Res Nets, VGGs and Wide Res Nets with various depths and widths. Our results show that Temp Balance significantly outperforms ordinary SGD and carefully-tuned spectral norm regularization. We also show that Temp Balance outperforms a number of state-of-the-art optimizers and learning rate schedulers. In this section, we give full details of the experimental setup (Section 4.1) and compare our method Temp Balance to a few baselines (Section 4.2), and then (Section 4.3) we perform ablation studies on varied initial learning rates, model widths, HT-SR layer-wise metrics, and PL fitting methods.
Researcher Affiliation Collaboration Yefan Zhou * 1, Tianyu Pang*2, Keqin Liu2, Charles H. Martin3, Michael W. Mahoney4, and Yaoqing Yang1 1Department of Computer Science, Dartmouth College 2National Center for Applied Mathematics and Department of Mathematics, Nanjing University 3Calculation Consulting 4ICSI, LBNL, and Department of Statistics, University of California at Berkeley
Pseudocode Yes Algorithm 1: Temp Balance
Open Source Code Yes Our code is open-sourced: https://github.com/YefanZhou/TempBalance.
Open Datasets Yes Datasets. We consider CIFAR100, CIFAR10, SVHN and Tiny Image Net (TIN) [72 75]. CIFAR100 consists of 50K pictures for training and 10K pictures for testing with 100 categories. CIFAR10 consists of 50K pictures for training and 10K pictures for testing with 10 categories. SVHN consists of around 73K pictures for training and around 26K pictures for testing with 10 categories. Tiny Image Net consists of 10K pictures for training and 10K images for testing with 200 classes. The datasets are accompanied by citations [72-75].
Dataset Splits No The paper specifies training and testing sets but does not explicitly mention or detail a validation set or how it was used for hyperparameter tuning or early stopping. For example, it states: "CIFAR100 consists of 50K pictures for training and 10K pictures for testing..."
Hardware Specification Yes The test platform was one Quadro RTX 6000 GPU with Intel Xeon Gold 6248 CPU.
Software Dependencies No The paper mentions "Our code is open-sourced" with a link, implying the software environment is available there, but it does not explicitly list specific software dependencies with version numbers within the text. For example, it does not state "PyTorch 1.x" or "CUDA 11.x".
Experiment Setup Yes The momentum and weight decay are 0.9 and 5 10 4, respectively, which are both standard choices. We grid search the optimal initial (base) learning rate η0 for the CAL baseline, using the grid {0.05, 0.1, 0.15} for Res Net and {0.025, 0.05, 0.1} for VGG. For SNR, we grid search the optimal regularization coefficient λsr... The hyperparameter values used across all experiments can be found in Appendix D. Table 1 reports the details of experiments shown in Figure 3. ... The default optimizer is SGD, trained with batch size 128, number of training epochs 200, weight decay 5e-4, and momentum 0.9.