reproducibilityindex.ai

Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

Authors: Kyriakos Axiotis, Vincent Cohen-Addad, Monika Henzinger, Sammy Jerome, Vahab Mirrokni, David Saulpic, David Woodruff, Michael Wunder

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms stateof-the-art methods. We also show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performance of leverage score sampling empirically, while being conceptually simpler and more scalable.
Researcher Affiliation	Collaboration	1Google Research 2Institute for Science and Technology Austria (ISTA), Klosterneuburg, Austria 3CNRS & IRIF, Universit e Paris Cit e, Paris, France 4Carnegie Mellon University.
Pseudocode	Yes	Algorithm 1 Data-Selection(D, k, ε, Λ, C) ... Algorithm 2 Data-Selection-Regression(A, k, ε, Λ, C)
Open Source Code	No	The paper does not provide an explicit statement or link indicating the availability of open-source code for the methodology described.
Open Datasets	Yes	We present our results on the UCI gas sensor dataset from the University of California, Irvine repository (Vergara, 2012; Vergara et al., 2012; Rodriguez-Lujan et al., 2014) in Figure 2. ... We use the WMT T2T En De translation task dataset (Bojar et al., 2014)... We use three classic datasets from UCI, MNIST (Le Cun et al., 1998), FMNIST (Xiao et al., 2017), and CIFAR-10 (Krizhevsky et al., 2009).
Dataset Splits	Yes	We use the WMT T2T En De translation task dataset (Bojar et al., 2014) which consists of 4, 592, 289 training examples, a test set of size 3003 and a validation set of size 3000. ... Finally, we train a model on all the k data points and evaluate it on a validation set.
Hardware Specification	No	All algorithms were implemented in python using the tensorflow framework and the runtime calculation experiments ran on CPU, on a cloud VM with 24 CPUs and 100GB of RAM.
Software Dependencies	Yes	For the clustering required in Algorithm 1, we run k -means clustering using Python s sklearn implementation... the standard k-medoid implementation from Python s scikit-learn library is significantly faster... All algorithms were implemented in python using the tensorflow framework...
Experiment Setup	Yes	We used a batch size of 128, a constant learning rate of 0.001, and dropout of 0.1. ... We train each model for 10 epochs, a batch size of 32, and use the Adam optimizer with a learning rate of 10 3.