Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Sketch to Adapt: Fine-Tunable Sketches for Efficient LLM Adaptation
Authors: Tianyi Zhang, Junda Su, Aditya Desai, Oscar Wu, Zhaozhuo Xu, Anshumali Shrivastava
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive evaluations with Llama and Mistral models demonstrate that Sketch Tune outperforms leading PEFT methods across diverse tasks while using substantially smaller base models and comparable trainable parameters. As a highlight, Sketch Tune outperforms Lo RA, Do RA, and S2FT on commonsense and math benchmarks using 2.63.5 smaller base models and exceeds Loft Q in accuracy by 14.48% on GSM8K with 7.3 fewer trainable parameters. |
| Researcher Affiliation | Collaboration | 1Rice University, Houston, TX 2x MAD.ai 3University of California, Berkeley, Berkeley, CA 4Stevens Institute of Technology, Hoboken, NJ 5Third AI Corp. 6Ken Kennedy Institute. |
| Pseudocode | Yes | Algorithm 1 Learning to Sketch LLM Weights |
| Open Source Code | Yes | Our code and model checkpoints are available publicly1. 1https://github.com/Lean Models/Sketch Tune |
| Open Datasets | Yes | For math problem-solving, we fine-tune these models on the Math10K dataset and evaluate on 7 different math reasoning datasets (Hu et al., 2023). For commonsense reasoning, we fine-tune on the Commonsense170K dataset and evaluate on 8 different commonsense reasoning datasets (Hu et al., 2023). To compare Sketch Tune against efficient quantized model fine-tuning methods, we follow the settings in Li et al. (2023b) to fine-tune and test Llama-2 models on the language modeling dataset Wiki Text-2 (Merity et al., 2022) and the math reasoning dataset GSM8K (Cobbe et al., 2021). |
| Dataset Splits | Yes | The Wiki Text-2 dataset (Merity et al., 2016) consists of 44.8k training data, consisting of 36.7K training data, 3.76K validatiaon data, and 4.36K test data. Following Loft Q (Li et al., 2023b), we used the training set to perform fine-tuning, and the validataion set to evaluate fine-tuned model s performance. |
| Hardware Specification | Yes | We sketch each model using a single Quadro RTX 8000-48GB GPU. For model training, we train each model using a single NVIDIA A100-40GB GPU. All experiments are performed on an NVIDIA A100-40GB GPU. |
| Software Dependencies | No | The paper mentions "Py Torch (Paszke et al., 2019)" and "Transformers library (Wolf et al., 2020)" but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | We optimize Sketch Tune s hyper-parameters, including learning rate and batch size, through a parameter sweep, and we report the hyper-parameters for training in Appendix I. Appendix I contains tables with hyperparameter selections for fine-tuning Sketch Tune on various tasks, including LR, Optimizer, Batch Size, Epochs, LR Scheduler, and Warmup Steps. |