Transferring Knowledge From Large Foundation Models to Small Downstream Models

Authors: Shikai Qiu, Boran Han, Danielle C. Maddix, Shuai Zhang, Bernie Wang, Andrew Gordon Wilson

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Across multiple vision, language, and multi-modal datasets, AFT achieves significantly better downstream performance compared to alternatives with a similar computational cost.
Researcher Affiliation Collaboration 1AWS AI Labs, Santa Clara, CA, USA 2Department of Computer Science, New York University, NYC, USA
Pseudocode Yes Algorithm 1 Adaptive Feature Transfer (AFT)
Open Source Code Yes Our code is available at https://github.com/amazon-science/adaptive-feature-transfer.
Open Datasets Yes on CIFAR-10 (Krizhevsky et al., 2009), CIFAR-100 (Krizhevsky et al., 2009), Oxford Flowers-102 (Nilsback & Zisserman, 2008), Oxford-IIIT Pets (Parkhi et al., 2012), Describable Textures Dataset (DTD) (Cimpoi et al., 2014) and Food-101 (Bossard et al., 2014) datasets.
Dataset Splits Yes We tune the hyperparameter β for AFT, KD, and B-Tuning in all experiments by holding out 10% of the original training set and selecting the β value that yields the highest accuracy on this holdout set.
Hardware Specification Yes Table 1 compares the runtime on an NVIDIA A100 GPU for training Vi T-S/16 (22M parameters) for one epoch on CIFAR-100...
Software Dependencies No The paper mentions using 'timm' and 'Hugging Face implementation' for models but does not provide specific version numbers for these or other software libraries or dependencies.
Experiment Setup Yes We use the Adam optimizer in all experiments and train for 5000 steps (rounded up to whole epochs) with a batch size of 128 and a cosine lr decay schedule. We use a base learning rate of 1e 4 for Vi T-S/16 and MLP Mixer-B, and 1e 3 for Res Net-50.