Are Large Kernels Better Teachers than Transformers for ConvNets?

Authors: Tianjin Huang, Lu Yin, Zhenyu Zhang, Li Shen, Meng Fang, Mykola Pechenizkiy, Zhangyang Wang, Shiwei Liu

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our findings are backed up by extensive experiments on both logit-level and feature-level KD out of the box, with no dedicated architectural nor training recipe modifications. Notably, we obtain the best-ever pure Conv Net under 30M parameters with 83.1% top-1 accuracy on Image Net, outperforming current SOTA methods including Conv Ne Xt V2 and Swin V2.
Researcher Affiliation Collaboration 1Department of Mathematics and Computer Science, Eindhoven University of Technology 2Department of Electrical and Computer Engineering, University of Texas at Austin 3JD Explore Academy 4Department of Computer Science, University of Liverpool.
Pseudocode No The paper provides mathematical formulations for distillation methods (LKD, NKD, FD) but does not include explicit pseudocode blocks or algorithm listings.
Open Source Code Yes Code is available at: https: //github.com/VITA-Group/SLa K.
Open Datasets Yes We conduct experiments on the commonly used Image Net-1K dataset (Russakovsky et al., 2015) containing 1k classes, 1,281,167 training images, and 50,000 validation images.
Dataset Splits Yes We conduct experiments on the commonly used Image Net-1K dataset (Russakovsky et al., 2015) containing 1k classes, 1,281,167 training images, and 50,000 validation images.
Hardware Specification Yes We train all models with 4 NVIDIA A100 GPUs.
Software Dependencies No The paper mentions using 'Adam W optimizer' and 'Rand Augment in Timm', but does not specify version numbers for these or other software dependencies like Python, PyTorch, or TensorFlow.
Experiment Setup Yes We use Adam W optimizer (Loshchilov & Hutter, 2019) and train models for 120 epochs (Section 4.1) and 300 epochs (Section 4.2) with a batch size of 4096, and a weight decay of 0.05. The learning rate is 4e-3 with a 20-epoch linear warmup followed by a cosine decaying schedule.