LEVI: Generalizable Fine-tuning via Layer-wise Ensemble of Different Views

Authors: Yuji Roh, Qingyun Liu, Huan Gui, Zhe Yuan, Yujin Tang, Steven Euijong Whang, Liang Liu, Shuchao Bi, Lichan Hong, Ed H. Chi, Zhe Zhao

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Broad experiments with large language and vision models show that LEVI greatly improves fine-tuning generalization via emphasizing different views from fine-tuning data and pre-trained features.
Researcher Affiliation Collaboration 1Google Inc, Mountain View, CA, USA 2Google Deep Mind, Mountain View, CA, USA 3School of Electrical Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea.
Pseudocode No The paper describes the framework in Section 4 and illustrates it with Figure 3, but it does not include explicit 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No The paper states: 'We use Tensor Flow (Abadi et al., 2015) with JAX (Bradbury et al., 2018) and Flax (Heek et al., 2023).' and provides links to these third-party libraries, but there is no explicit statement about releasing the code for the methodology described in this paper.
Open Datasets Yes For languagebased recommendation, we use Movie Lens (Harper & Konstan, 2015) and Amazon Review (Ni et al., 2019) ... For vision, we use Diabetic Retinopathy (Medical) (Emma Dugas, 2015) and Image Net Variants (Wang et al., 2019; Hendrycks et al., 2021a;b; Recht et al., 2019). All datasets are from Tensor Flow Datasets (TFD). ... Tensor Flow Datasets, a collection of ready-to-use datasets. https://www.tensorflow.org/datasets.
Dataset Splits Yes Settings. We consider three data distributions each for pretraining, fine-tuning, and testing. Pre-training data and test data are not available, and we only have fine-tuning data and pre-trained model features. When we refer to training data, we mean the fine-tuning data. We consider the fine-tuning distribution as in-distribution (ID), and the test distribution as out-of-distribution (OOD). Thus, ID data represents the samples that the model has been trained, and OOD data represents unfamiliar samples not seen during training. For each algorithm, we choose the best hyperparameters from the above candidate sets to achieve the best performance in the in-distribution validation set, while not accessing the out-of-distribution datasets.
Hardware Specification Yes In all experiments, we use Dragonfish TPU (i.e., TPUv3) and Jellyfish TPU (i.e., TPUv2) with 2x2 topology for T5x and Vi T experiments, respectively.
Software Dependencies No The paper states 'we use Tensor Flow (Abadi et al., 2015) with JAX (Bradbury et al., 2018) and Flax (Heek et al., 2023).' However, it does not provide specific version numbers for these software dependencies, only citations to their original papers.
Experiment Setup Yes Here are common hyperparameters and settings for all algorithms. We use the Adam optimizer and SGD optimizer for T5x and Vi T experiments, respectively. For batch sizes, we use 200 for Movie Lens, 100 for Amazon Review, and 512 for all computer vision datasets. For learning rates, we consider a set {0.0001, 0.001, 0.01, 0.1} for all algorithms except linear probing. In linear probing, we use a learning rate set with larger values {0.001, 0.01, 0.1, 1.0, 10.0}.