Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

Authors: Kyriakos Axiotis, Vincent Cohen-Addad, Monika Henzinger, Sammy Jerome, Vahab Mirrokni, David Saulpic, David Woodruff, Michael Wunder

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms stateof-the-art methods. We also show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performance of leverage score sampling empirically, while being conceptually simpler and more scalable.
Researcher Affiliation Collaboration 1Google Research 2Institute for Science and Technology Austria (ISTA), Klosterneuburg, Austria 3CNRS & IRIF, Universit e Paris Cit e, Paris, France 4Carnegie Mellon University.
Pseudocode Yes Algorithm 1 Data-Selection(D, k, ε, Λ, C) ... Algorithm 2 Data-Selection-Regression(A, k, ε, Λ, C)
Open Source Code No The paper does not provide an explicit statement or link indicating the availability of open-source code for the methodology described.
Open Datasets Yes We present our results on the UCI gas sensor dataset from the University of California, Irvine repository (Vergara, 2012; Vergara et al., 2012; Rodriguez-Lujan et al., 2014) in Figure 2. ... We use the WMT T2T En De translation task dataset (Bojar et al., 2014)... We use three classic datasets from UCI, MNIST (Le Cun et al., 1998), FMNIST (Xiao et al., 2017), and CIFAR-10 (Krizhevsky et al., 2009).
Dataset Splits Yes We use the WMT T2T En De translation task dataset (Bojar et al., 2014) which consists of 4, 592, 289 training examples, a test set of size 3003 and a validation set of size 3000. ... Finally, we train a model on all the k data points and evaluate it on a validation set.
Hardware Specification No All algorithms were implemented in python using the tensorflow framework and the runtime calculation experiments ran on CPU, on a cloud VM with 24 CPUs and 100GB of RAM.
Software Dependencies Yes For the clustering required in Algorithm 1, we run k -means clustering using Python s sklearn implementation... the standard k-medoid implementation from Python s scikit-learn library is significantly faster... All algorithms were implemented in python using the tensorflow framework...
Experiment Setup Yes We used a batch size of 128, a constant learning rate of 0.001, and dropout of 0.1. ... We train each model for 10 epochs, a batch size of 32, and use the Adam optimizer with a learning rate of 10 3.