Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization
Authors: Weiyang Liu, Zeju Qiu, Yao Feng, Yuliang Xiu, Yuxuan Xue, Longhui Yu, Haiwen Feng, Zhen Liu, Juyeon Heo, Songyou Peng, Yandong Wen, Michael J. Black, Adrian Weller, Bernhard Schölkopf
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we conduct an extensive empirical study of adapting large vision transformers, large language models, and text-to-image diffusion models to various downstream tasks in vision and language. |
| Researcher Affiliation | Collaboration | 1Max Planck Institute for Intelligent Systems T ubingen 2University of Cambridge 3ETH Z urich 4University of T ubingen 5Mila, Universit e de Montr eal 6The Alan Turing Institute |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | Yes | boft.wyliu.com |
| Open Datasets | Yes | To evaluate the performance of BOFT on LLM adaptation, we first finetune a pretrained De BERTa V3-base model [25] on the GLUE benchmark [87]...We use Alpaca [80] as our finetuning dataset and evaluate both zero-shot and fewshot performance on the MMLU dataset [27]...using two challenging benchmarks: GSM8K [11] and MATH [27]...We evaluate the finetuning performance of BOFT on the VTAB1K benchmark [94]...on a high-quality segmentation dataset, HQSeg-44K [34]...We finetune the pretrained Stable Diffusion [73] |
| Dataset Splits | Yes | Results are presented in Table 1. # Param in the table denotes the total number of effective trainable parameters for each method. We note that OFT [67] with the block size 16 is BOFT(1,16). |
| Hardware Specification | Yes | All runs can be trained on a single NVIDIA A100-SXM4-80GB GPU. |
| Software Dependencies | No | The paper mentions software like "Hugging Face s Diffusers [85]" and "Parameter-Efficient Fine-Tuning (PEFT) [55]" but does not specify their version numbers. |
| Experiment Setup | Yes | For our experiments on the GLUE benchmark [87], we follow the setting of [97] and only tune the learning rate, the multiplicative dropout rate, and the number of training epochs. ... a total number of 30 training epochs, a fixed training batch size of 64, an Adam W optimizer, and a cosine learning rate scheduler with a warmup ratio of 0.1. |