Initializing Models with Larger Ones

Authors: Zhiqiu Xu, Yanjie Chen, Kirill Vishniakov, Yida Yin, Zhiqiang Shen, Trevor Darrell, Lingjie Liu, Zhuang Liu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We apply weight selection to train small models on image classification datasets of different scales. We observe significant improvement in accuracy across datasets and models compared with baselines. Weight selection also substantially reduces the training time required to reach the same level of accuracy. Additionally, it can work alongside another popular method for knowledge transfer from large models knowledge distillation (Hinton et al., 2015). We believe weight selection can be a general technique for training small models. Our work also encourages further research on utilizing pretrained models for efficient deployment. [...] 4 EXPERIMENTS
Researcher Affiliation Collaboration Zhiqiu Xu1, Yanjie Chen2, Kirill Vishniakov3, Yida Yin2, Zhiqiang Shen3, Trevor Darrell2, Lingjie Liu1, Zhuang Liu4 1University of Pennsylvania 2UC Berkeley 3MBZUAI 4Meta AI Research
Pseudocode Yes Algorithm 1 Uniform element selection from teacher s weight tensor
Open Source Code Yes Code is available at https://github.com/Oscar XZQ/weight-selection.
Open Datasets Yes Datasets. We evaluate weight selection on 9 image classification datasets including Image Net1K (Deng et al., 2009), CIFAR-10, CIFAR-100 (Krizhevsky, 2009), Flowers (Nilsback & Zisserman, 2008), Pets (Parkhi et al., 2012), STL-10 (Coates et al., 2011), Food-101 (Bossard et al., 2014)), DTD (Cimpoi et al., 2014), SVHN (Netzer et al., 2011) and Euro SAT (Helber et al., 2019; 2018).
Dataset Splits No No explicit statement of validation dataset splits (e.g., percentages, counts, or specific predefined splits) is provided, only references to total training images and test accuracy.
Hardware Specification No No specific hardware details (such as GPU or CPU models, or cloud computing instance types) are provided for the experimental setup.
Software Dependencies No The paper mentions software components like PyTorch and timm library, and various regularization methods (randaugment, mixup, cutmix, random erasing, label smoothing, layer scale, head init scale) but does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes We follow the training recipe from Conv Ne Xt (Liu et al., 2022) with adjustments to batch size, learning rate, and stochastic depth rate (Huang et al., 2016) for different datasets. See Appendix A for details. [...] Table 11: Our basic recipe. [...] Table 12: Hyper-parameter setting on Conv Ne Xt-F. [...] Table 13: Hyper-parameter setting on Vi T-T.