Improving Generalization and Convergence by Enhancing Implicit Regularization

Authors: Mingze Wang, Jinbo Wang, Haotian He, Zilin Wang, Guanhua Huang, Feiyu Xiong, Zhiyu li, Weinan E, Lei Wu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that IRE consistently improves the generalization performance for image classification tasks across a variety of benchmark datasets (CIFAR-10/100, Image Net) and models (Res Nets and Vi Ts). Surprisingly, IRE also achieves a 2 speed-up compared to Adam W in the pre-training of Llama models (of sizes ranging from 60M to 229M) on datasets including Wikitext-103, Minipile, and Openwebtext.
Researcher Affiliation Collaboration 1School of Mathematical Sciences, Peking University 2Center for Machine Learning Research, Peking University 3Institute for Advanced Algorithms Research (Shanghai) 4AI for Science Institute 5School of Data Science, University of Science and Technology of China 6Byte Dance Research
Pseudocode Yes Algorithm 1: Practical IRE (A practical framework of implementing IRE)
Open Source Code Yes The code is available at https://github.com/wmz9/IRE-algorithm-framework.
Open Datasets Yes Experiments show that IRE consistently improves the generalization performance for image classification tasks across a variety of benchmark datasets (CIFAR-10/100, Image Net) and models (Res Nets and Vi Ts). Surprisingly, IRE also achieves a 2 speed-up compared to Adam W in the pre-training of Llama models (of sizes ranging from 60M to 229M) on datasets including Wikitext-103, Minipile, and Openwebtext.
Dataset Splits No The paper mentions training data and validation loss but does not explicitly provide the training/validation/test dataset splits (e.g., percentages or sample counts) needed for reproduction.
Hardware Specification Yes In this section, all experiments were conducted using a single A800 GPU... the experiments on CIFAR-10/CIFAR-100 were conducted using a single A800 GPU, and the experiments on Image Net were conducted using 4 A800 GPUs... The experiments in this section are conducted on 1 A800... The experiments are conducted on 4 H800.
Software Dependencies No The paper mentions software like PyTorch and libraries like HuggingFace's Llama code, but does not provide specific version numbers for these software components.
Experiment Setup Yes The mini-batch size is set to 128, the weight decay is set to 5e-4, and the ρ in SAM is to 0.05... a fixed lr 0.1 is used... a step-decayed lr schedule is employed, starting at 0.1 and reducing lr by a factor of 5 at epoch 20, 50, 80... cosine learning rate decay is adopted with an initial lr 0.1... the momentum is set to 0.9, the batch size is set to 128, and the weight decay is set to 5e-4; for SAM, ρ is set to 0.05 for CIFAR-10 and 0.1 for CIFAR-100. For SGD-IRE/SAM-IRE, we fix K = 10, and tune hyperparameters γ and κ via a grid search over γ {0.99, 0.9, 0.8} and κ {1, 2}.