Initializing Models with Larger Ones
Authors: Zhiqiu Xu, Yanjie Chen, Kirill Vishniakov, Yida Yin, Zhiqiang Shen, Trevor Darrell, Lingjie Liu, Zhuang Liu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply weight selection to train small models on image classification datasets of different scales. We observe significant improvement in accuracy across datasets and models compared with baselines. Weight selection also substantially reduces the training time required to reach the same level of accuracy. Additionally, it can work alongside another popular method for knowledge transfer from large models knowledge distillation (Hinton et al., 2015). We believe weight selection can be a general technique for training small models. Our work also encourages further research on utilizing pretrained models for efficient deployment. [...] 4 EXPERIMENTS |
| Researcher Affiliation | Collaboration | Zhiqiu Xu1, Yanjie Chen2, Kirill Vishniakov3, Yida Yin2, Zhiqiang Shen3, Trevor Darrell2, Lingjie Liu1, Zhuang Liu4 1University of Pennsylvania 2UC Berkeley 3MBZUAI 4Meta AI Research |
| Pseudocode | Yes | Algorithm 1 Uniform element selection from teacher s weight tensor |
| Open Source Code | Yes | Code is available at https://github.com/Oscar XZQ/weight-selection. |
| Open Datasets | Yes | Datasets. We evaluate weight selection on 9 image classification datasets including Image Net1K (Deng et al., 2009), CIFAR-10, CIFAR-100 (Krizhevsky, 2009), Flowers (Nilsback & Zisserman, 2008), Pets (Parkhi et al., 2012), STL-10 (Coates et al., 2011), Food-101 (Bossard et al., 2014)), DTD (Cimpoi et al., 2014), SVHN (Netzer et al., 2011) and Euro SAT (Helber et al., 2019; 2018). |
| Dataset Splits | No | No explicit statement of validation dataset splits (e.g., percentages, counts, or specific predefined splits) is provided, only references to total training images and test accuracy. |
| Hardware Specification | No | No specific hardware details (such as GPU or CPU models, or cloud computing instance types) are provided for the experimental setup. |
| Software Dependencies | No | The paper mentions software components like PyTorch and timm library, and various regularization methods (randaugment, mixup, cutmix, random erasing, label smoothing, layer scale, head init scale) but does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | We follow the training recipe from Conv Ne Xt (Liu et al., 2022) with adjustments to batch size, learning rate, and stochastic depth rate (Huang et al., 2016) for different datasets. See Appendix A for details. [...] Table 11: Our basic recipe. [...] Table 12: Hyper-parameter setting on Conv Ne Xt-F. [...] Table 13: Hyper-parameter setting on Vi T-T. |