Are Large Kernels Better Teachers than Transformers for ConvNets?
Authors: Tianjin Huang, Lu Yin, Zhenyu Zhang, Li Shen, Meng Fang, Mykola Pechenizkiy, Zhangyang Wang, Shiwei Liu
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our findings are backed up by extensive experiments on both logit-level and feature-level KD out of the box, with no dedicated architectural nor training recipe modifications. Notably, we obtain the best-ever pure Conv Net under 30M parameters with 83.1% top-1 accuracy on Image Net, outperforming current SOTA methods including Conv Ne Xt V2 and Swin V2. |
| Researcher Affiliation | Collaboration | 1Department of Mathematics and Computer Science, Eindhoven University of Technology 2Department of Electrical and Computer Engineering, University of Texas at Austin 3JD Explore Academy 4Department of Computer Science, University of Liverpool. |
| Pseudocode | No | The paper provides mathematical formulations for distillation methods (LKD, NKD, FD) but does not include explicit pseudocode blocks or algorithm listings. |
| Open Source Code | Yes | Code is available at: https: //github.com/VITA-Group/SLa K. |
| Open Datasets | Yes | We conduct experiments on the commonly used Image Net-1K dataset (Russakovsky et al., 2015) containing 1k classes, 1,281,167 training images, and 50,000 validation images. |
| Dataset Splits | Yes | We conduct experiments on the commonly used Image Net-1K dataset (Russakovsky et al., 2015) containing 1k classes, 1,281,167 training images, and 50,000 validation images. |
| Hardware Specification | Yes | We train all models with 4 NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions using 'Adam W optimizer' and 'Rand Augment in Timm', but does not specify version numbers for these or other software dependencies like Python, PyTorch, or TensorFlow. |
| Experiment Setup | Yes | We use Adam W optimizer (Loshchilov & Hutter, 2019) and train models for 120 epochs (Section 4.1) and 300 epochs (Section 4.2) with a batch size of 4096, and a weight decay of 0.05. The learning rate is 4e-3 with a 20-epoch linear warmup followed by a cosine decaying schedule. |