Sharpness-Aware Minimization Activates the Interactive Teaching's Understanding and Optimization

Authors: Mingwei Xu, Xiaofeng Cao, Ivor Tsang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we will conduct experiments on two core baselines, co-teaching [16] and CNLCU [40]. For all datasets, we utilize a 9-layer CNN architecture [16] with dropout and batch normalization for the classification task. In co-teaching, for all datasets, we use the Adam optimizer with a momentum of 0.9, an initial learning rate of 0.001, and trained for 200 epochs.
Researcher Affiliation Academia Mingwei Xu 1, Xiaofeng Cao 1, Ivor W. Tsang 2,3 1 School of Artificial Intelligence, Jilin University, China 2 CFAR and IHPC, Agency for Science, Technology and Research (A*STAR), Singapore 3 College of Computing and Data Science, Nanyang Technological University, Singapore
Pseudocode Yes Algorithm 1: Sharpness Reduction Interactive Teaching (SRIT) Input: Initial network parameters θf0, θg0, learning rate η, fixed parameter τ, iteration counts Tk and Tmax, maximum iteration count Nmax, pre-defined constant ρ. Output: Updated network parameters θf and θg.
Open Source Code Yes The paper will make the code openly accessible for reproducing the experimental results.
Open Datasets Yes Based on previous research [16, 44, 40], we conduct experiments on five widely used datasets to effectively demonstrate the efficacy of the co-teaching algorithm. These datasets include MNIST [23], FMNIST [41], CIFAR10 [21], SVHN [33], and CIFAR100 [21].
Dataset Splits Yes In co-teaching, we do not use validation dataset as in research [16]. However, to maintain consistency with CNLCU, we use 90% of the training data and 10% as the validation set in CNLCU.
Hardware Specification Yes In the experiment, we use four NVIDIA RTX 6000 GPUs with 24GB of memory each.
Software Dependencies No The paper mentions optimizers (Adam, SGD) and batch normalization, but does not specify version numbers for any software libraries, frameworks, or programming languages used in the experiments.
Experiment Setup Yes For all datasets, we utilize a 9-layer CNN architecture [16] with dropout and batch normalization for the classification task. In co-teaching, for all datasets, we use the Adam optimizer with a momentum of 0.9, an initial learning rate of 0.001, and trained for 200 epochs. For R(T) = 1 min n T Tk τ, τ o , where Tk is set to 10 by default [16]. In SAM related optimization, such as SRIT and SRCNLCU, we use an SGD optimizer with an initial learning rate of 0.1, momentum of 0.9, weight decay of 0.0001, epochs of 200, and set ρ to 0.05 [13]. ... we empirically set the batch size to 128 as an optimal choice.