Repurposing Language Models into Embedding Models: Finding the Compute-Optimal Recipe
Authors: Albert Q. Jiang, Alicja Ziarko, Bartosz Piotrowski, Wenda Li, Mateja Jamnik, Piotr Miłoś
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our innovation is an algorithm that produces optimal configurations of model sizes, data quantities, and fine-tuning methods for text-embedding models at different computational budget levels. The resulting recipe, which we obtain through extensive experiments, can be used by practitioners to make informed design choices for their embedding models. To answer this, we performed an extensive empirical study. 4 Experiments We first specify the relevant details of our experimental setup (Section 4.1). Next, we present the results of our experiments where we contrastively train a grid of models of different sizes, using different computational budgets, and apply different compute-optimal fine-tuning methods with varying hyperparameters (Section 4.2). |
| Researcher Affiliation | Collaboration | Albert Q. Jiang * University of Cambridge Alicja Ziarko * IDEAS NCBR University of Warsaw IMPAN Bartosz Piotrowski IDEAS NCBR Wenda Li University of Edinburgh Mateja Jamnik University of Cambridge Piotr Miło s IDEAS NCBR University of Warsaw IMPAN, deepsense.ai |
| Pseudocode | Yes | Algorithm 1 Recipe for compute-optimal embedding model Input: compute budget C. Output: fine-tuning method, model size, data quantity, and (optionally) the method s hyperparams. if B 9.06e16 FLOP then Use full fine-tuning. Go to Figure 6a to find the optimal model parameters N given budget C. Calculate the data quantity D. return Full fine-tuning, N, D, (). else Use Lo RA. Go to Figure 6b to find the optimal model parameters N given budget C. Go to Figure 5 to find the Lo RA rank R according to C and N. return Lo RA, N, D, (R). |
| Open Source Code | Yes | We open-source the code to train and evaluate our models at: https://github.com/Seq DM/Efficient-Embeddings. |
| Open Datasets | Yes | We fine-tune our models on the English partition of the BAAI BGE dataset [Xiao et al., 2023], which contains 200 million semantically related (query, value) pairs from various internet sources such as Wikipedia and Stack Exchange. We utilize the BAAI dataset available at https://data.baai.ac.cn/details/BAAI-MTP. |
| Dataset Splits | No | We split our dataset into train and test set, with test set being the points corresponding to Pythia 2.8B which is the biggest model that we consider and train set the rest. |
| Hardware Specification | Yes | For all the trainings, we used a cluster with A100 GPUs with 40GB VRAM or with 80GB VRAM. We used 4 CPUs and 128GB of RAM per GPU. |
| Software Dependencies | No | We use the Adam W optimiser [Loshchilov and Hutter, 2019] and a cosine learning rate scheduler during training... We use the Transformers [Wolf et al., 2020] library for the training of the models and use the Accelerate [Gugger et al., 2022] library to facilitate multi-GPU training. |
| Experiment Setup | Yes | batch_size = 1024 context_length = 75 Adam W.weight_decay = 0.1 tau = 0.025 # temperature parameter. We use the Adam W optimiser [Loshchilov and Hutter, 2019] and a cosine learning rate scheduler during training. The learning rate first goes through a linear warm-up phase of 1/10 of the total steps to a peak that is 1/10 of the pre-training peak learning rate, that is, to learning rates between 1.2 10 5 and 6 10 5. Then, it decays in a cosine schedule to 1/10 of the maximum at the end of training. |