DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models

Authors: Yongchan Kwon, Eric Wu, Kevin Wu, James Zou

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through systematic empirical evaluations, we show that Data Inf accurately approximates influence scores and is orders of magnitude faster than existing methods.
Researcher Affiliation Academia Columbia University , Stanford University
Pseudocode Yes We provide a pseudo algorithm in Appendix A.
Open Source Code Yes Python-based implementation codes are available at https://github.com/ ykwon0407/Data Inf.
Open Datasets Yes For all experiments, we consider publicly available and widely used large-scale LLMs and diffusion models. We use the Ro BERTa model (Liu et al., 2019) for the approximation error analysis and mislabeled data detection tasks, and the Llama-2-13B-chat (Touvron et al., 2023) and the stable-diffusion-v1.5 (Rombach et al., 2022) models for the influential data identification task. We used the training and validation splits of the dataset available at Hugging Face Datasets library (Lhoest et al., 2021).
Dataset Splits Yes We used the training and validation splits of the dataset available at Hugging Face Datasets library (Lhoest et al., 2021). Only the training dataset is used to fine-tune the model, and we compute the influence of individual training data points on the validation loss. For GLUE-SST2 and GLUE-QQP, we randomly sample 4500 (resp. 500) samples from the original training (resp. validation) dataset.
Hardware Specification Yes The training was performed on a single machine with one NVIDIA A40 GPU using the Hugging Face Peft library (Mangrulkar et al., 2022).The training was performed on a single machine with 4 NVIDIA V100 GPUs using the Hugging Face Peft library (Mangrulkar et al., 2022).
Software Dependencies No The paper mentions using 'Hugging Face Transformers library' and 'Hugging Face Peft library' but does not specify their version numbers.
Experiment Setup Yes Across all fine-tuning runs, we use a learning rate of 3 10 4 with a batch size of 32 across 10 training epochs. As for the Lo RA hyperparameters, the dropout rate is set to be 0.05. We choose the rank of the Lo RA matrix r from {1, 2, 4, 8} and α is always set to be r.