Sharpness-Aware Minimization Activates the Interactive Teaching's Understanding and Optimization
Authors: Mingwei Xu, Xiaofeng Cao, Ivor Tsang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we will conduct experiments on two core baselines, co-teaching [16] and CNLCU [40]. For all datasets, we utilize a 9-layer CNN architecture [16] with dropout and batch normalization for the classification task. In co-teaching, for all datasets, we use the Adam optimizer with a momentum of 0.9, an initial learning rate of 0.001, and trained for 200 epochs. |
| Researcher Affiliation | Academia | Mingwei Xu 1, Xiaofeng Cao 1, Ivor W. Tsang 2,3 1 School of Artificial Intelligence, Jilin University, China 2 CFAR and IHPC, Agency for Science, Technology and Research (A*STAR), Singapore 3 College of Computing and Data Science, Nanyang Technological University, Singapore |
| Pseudocode | Yes | Algorithm 1: Sharpness Reduction Interactive Teaching (SRIT) Input: Initial network parameters θf0, θg0, learning rate η, fixed parameter τ, iteration counts Tk and Tmax, maximum iteration count Nmax, pre-defined constant ρ. Output: Updated network parameters θf and θg. |
| Open Source Code | Yes | The paper will make the code openly accessible for reproducing the experimental results. |
| Open Datasets | Yes | Based on previous research [16, 44, 40], we conduct experiments on five widely used datasets to effectively demonstrate the efficacy of the co-teaching algorithm. These datasets include MNIST [23], FMNIST [41], CIFAR10 [21], SVHN [33], and CIFAR100 [21]. |
| Dataset Splits | Yes | In co-teaching, we do not use validation dataset as in research [16]. However, to maintain consistency with CNLCU, we use 90% of the training data and 10% as the validation set in CNLCU. |
| Hardware Specification | Yes | In the experiment, we use four NVIDIA RTX 6000 GPUs with 24GB of memory each. |
| Software Dependencies | No | The paper mentions optimizers (Adam, SGD) and batch normalization, but does not specify version numbers for any software libraries, frameworks, or programming languages used in the experiments. |
| Experiment Setup | Yes | For all datasets, we utilize a 9-layer CNN architecture [16] with dropout and batch normalization for the classification task. In co-teaching, for all datasets, we use the Adam optimizer with a momentum of 0.9, an initial learning rate of 0.001, and trained for 200 epochs. For R(T) = 1 min n T Tk τ, τ o , where Tk is set to 10 by default [16]. In SAM related optimization, such as SRIT and SRCNLCU, we use an SGD optimizer with an initial learning rate of 0.1, momentum of 0.9, weight decay of 0.0001, epochs of 200, and set ρ to 0.05 [13]. ... we empirically set the batch size to 128 as an optimal choice. |