Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
DELIFT: Data Efficient Language model Instruction Fine-Tuning
Authors: Ishika Agarwal, Krishnateja Killamsetty, Lucian Popa, Marina Danilevsky
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results across multiple datasets and model scales show DELIFT reduces fine-tuning data requirements by up to 70% without compromising performance, consistently outperforming existing methods by up to 26% in effectiveness and efficiency. |
| Researcher Affiliation | Collaboration | 1University of Illinois Urbana-Champaign, 2IBM Research |
| Pseudocode | Yes | Algorithm 1 Greedy Maximization for Submodular Function |
| Open Source Code | Yes | Our complete code base is publicly available at https://github.com/agarwalishika/delift, enabling further exploration and replication. |
| Open Datasets | Yes | Datasets. We group the datasets by the primary goal of fine-tuning, ensuring a clear mapping from the data to the corresponding submodular objective. In particular: 1. Instruction Tuning: Mix-Instruct (Jiang et al., 2023), P3 (Sanh et al., 2021). Both aim to enhance general instruction-following behavior, featuring a variety of task prompts and user requests. 2. Task-Specific Fine Tuning: Hotpot QA (Yang et al., 2018) aligned with MMLU (Hendrycks et al., 2021), Mix-Instruct aligned with MT-Bench (Zheng et al., 2023), and Mix-Instruct aligned with GSM-8k (Cobbe et al., 2021). These pairings allow us to extract only the most relevant samples from a large corpus to improve performance on a specific target benchmark. 3. Continual Fine-Tuning: (a) SQu AD (Rajpurkar et al., 2016) paired with Hotpot QA to inject more complex, multi-hop reasoning data after simpler QA, and (b) a proprietary IBM/Government domain query rewriting dataset.1 |
| Dataset Splits | Yes | In all cases, we fixed an approximate budget of 30% for subset selection unless otherwise noted, striking a balance between data efficiency and coverage. Beyond consistently using 30% of the data in our main experiments, we investigated how varying the subset size influences performance. We tested budgets ranging from as little as 5% up to 50% of the original training set (in increments of 10%). |
| Hardware Specification | No | A part of this work used the Delta system at the National Center for Supercomputing Applications through allocation CIS240550 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296. |
| Software Dependencies | No | The paper mentions LLMs like Llama-3.2-3B, Mistral-7B-v0.1, Qwen2-72B-Instruct, Phi-3-mini-128k-instruct, and fine-tuning methods like ICL and QLoRA, but does not provide specific version numbers for any software libraries or packages used for implementation. |
| Experiment Setup | Yes | Consistent hyperparameter settings were maintained across all experiments to ensure reproducibility: Submodular Function: Utilized Facility Location (FL), Facility Location Mutual Information (FLMI), or Facility Location Conditional Gain (FLCG) based on the use case. Utility Metric Scaling Factor: Set η = 1 for FLMI and ν = 1 for FLCG. Budget (% of Data): Fixed at 30% for all subset selection experiments. Optimization Algorithm: Employed greedy maximization with a stopping criterion based on the budget. Distance Metric: Used length-normalized L2 norm. Teacher Forcing Technique: Applied during utility metric computation to ensure reliable prediction accuracy measurement. |