reproducibilityindex.ai

Are Large Kernels Better Teachers than Transformers for ConvNets?

Authors: Tianjin Huang, Lu Yin, Zhenyu Zhang, Li Shen, Meng Fang, Mykola Pechenizkiy, Zhangyang Wang, Shiwei Liu

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our findings are backed up by extensive experiments on both logit-level and feature-level KD out of the box, with no dedicated architectural nor training recipe modifications. Notably, we obtain the best-ever pure Conv Net under 30M parameters with 83.1% top-1 accuracy on Image Net, outperforming current SOTA methods including Conv Ne Xt V2 and Swin V2.
Researcher Affiliation	Collaboration	1Department of Mathematics and Computer Science, Eindhoven University of Technology 2Department of Electrical and Computer Engineering, University of Texas at Austin 3JD Explore Academy 4Department of Computer Science, University of Liverpool.
Pseudocode	No	The paper provides mathematical formulations for distillation methods (LKD, NKD, FD) but does not include explicit pseudocode blocks or algorithm listings.
Open Source Code	Yes	Code is available at: https: //github.com/VITA-Group/SLa K.
Open Datasets	Yes	We conduct experiments on the commonly used Image Net-1K dataset (Russakovsky et al., 2015) containing 1k classes, 1,281,167 training images, and 50,000 validation images.
Dataset Splits	Yes	We conduct experiments on the commonly used Image Net-1K dataset (Russakovsky et al., 2015) containing 1k classes, 1,281,167 training images, and 50,000 validation images.
Hardware Specification	Yes	We train all models with 4 NVIDIA A100 GPUs.
Software Dependencies	No	The paper mentions using 'Adam W optimizer' and 'Rand Augment in Timm', but does not specify version numbers for these or other software dependencies like Python, PyTorch, or TensorFlow.
Experiment Setup	Yes	We use Adam W optimizer (Loshchilov & Hutter, 2019) and train models for 120 epochs (Section 4.1) and 300 epochs (Section 4.2) with a batch size of 4096, and a weight decay of 0.05. The learning rate is 4e-3 with a 20-epoch linear warmup followed by a cosine decaying schedule.