Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Are Large Kernels Better Teachers than Transformers for ConvNets?
Authors: Tianjin Huang, Lu Yin, Zhenyu Zhang, Li Shen, Meng Fang, Mykola Pechenizkiy, Zhangyang Wang, Shiwei Liu
ICML 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our findings are backed up by extensive experiments on both logit-level and feature-level KD out of the box, with no dedicated architectural nor training recipe modifications. Notably, we obtain the best-ever pure Conv Net under 30M parameters with 83.1% top-1 accuracy on Image Net, outperforming current SOTA methods including Conv Ne Xt V2 and Swin V2. |
| Researcher Affiliation | Collaboration | 1Department of Mathematics and Computer Science, Eindhoven University of Technology 2Department of Electrical and Computer Engineering, University of Texas at Austin 3JD Explore Academy 4Department of Computer Science, University of Liverpool. |
| Pseudocode | No | The paper provides mathematical formulations for distillation methods (LKD, NKD, FD) but does not include explicit pseudocode blocks or algorithm listings. |
| Open Source Code | Yes | Code is available at: https: //github.com/VITA-Group/SLa K. |
| Open Datasets | Yes | We conduct experiments on the commonly used Image Net-1K dataset (Russakovsky et al., 2015) containing 1k classes, 1,281,167 training images, and 50,000 validation images. |
| Dataset Splits | Yes | We conduct experiments on the commonly used Image Net-1K dataset (Russakovsky et al., 2015) containing 1k classes, 1,281,167 training images, and 50,000 validation images. |
| Hardware Specification | Yes | We train all models with 4 NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions using 'Adam W optimizer' and 'Rand Augment in Timm', but does not specify version numbers for these or other software dependencies like Python, PyTorch, or TensorFlow. |
| Experiment Setup | Yes | We use Adam W optimizer (Loshchilov & Hutter, 2019) and train models for 120 epochs (Section 4.1) and 300 epochs (Section 4.2) with a batch size of 4096, and a weight decay of 0.05. The learning rate is 4e-3 with a 20-epoch linear warmup followed by a cosine decaying schedule. |