Transferring Knowledge From Large Foundation Models to Small Downstream Models
Authors: Shikai Qiu, Boran Han, Danielle C. Maddix, Shuai Zhang, Bernie Wang, Andrew Gordon Wilson
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Across multiple vision, language, and multi-modal datasets, AFT achieves significantly better downstream performance compared to alternatives with a similar computational cost. |
| Researcher Affiliation | Collaboration | 1AWS AI Labs, Santa Clara, CA, USA 2Department of Computer Science, New York University, NYC, USA |
| Pseudocode | Yes | Algorithm 1 Adaptive Feature Transfer (AFT) |
| Open Source Code | Yes | Our code is available at https://github.com/amazon-science/adaptive-feature-transfer. |
| Open Datasets | Yes | on CIFAR-10 (Krizhevsky et al., 2009), CIFAR-100 (Krizhevsky et al., 2009), Oxford Flowers-102 (Nilsback & Zisserman, 2008), Oxford-IIIT Pets (Parkhi et al., 2012), Describable Textures Dataset (DTD) (Cimpoi et al., 2014) and Food-101 (Bossard et al., 2014) datasets. |
| Dataset Splits | Yes | We tune the hyperparameter β for AFT, KD, and B-Tuning in all experiments by holding out 10% of the original training set and selecting the β value that yields the highest accuracy on this holdout set. |
| Hardware Specification | Yes | Table 1 compares the runtime on an NVIDIA A100 GPU for training Vi T-S/16 (22M parameters) for one epoch on CIFAR-100... |
| Software Dependencies | No | The paper mentions using 'timm' and 'Hugging Face implementation' for models but does not provide specific version numbers for these or other software libraries or dependencies. |
| Experiment Setup | Yes | We use the Adam optimizer in all experiments and train for 5000 steps (rounded up to whole epochs) with a batch size of 128 and a cosine lr decay schedule. We use a base learning rate of 1e 4 for Vi T-S/16 and MLP Mixer-B, and 1e 3 for Res Net-50. |