3-in-1: 2D Rotary Adaptation for Efficient Finetuning, Efficient Batching and Composability

Authors: Baohao Liao, Christof Monz

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To assess the efficacy of Ro Ad, we perform comprehensive evaluations on the GLUE benchmark [56], eight commonsense reasoning tasks and four arithmetic reasoning tasks, utilizing Ro BERTa [31] and LLa MA [52, 53] (Section 4.1). The results consistently show that Ro Ad surpasses other PEFT methods while maintaining a significantly reduced scale of trainable parameters (< 0.1%), as depicted in Figure 1.
Researcher Affiliation Collaboration Baohao Liao1,2 Christof Monz1 1Language Technology Lab, University of Amsterdam 2e Bay Inc., Aachen, Germany
Pseudocode No The paper describes the method using mathematical equations and textual explanations, including an overview diagram (Figure 3), but no explicit pseudocode or algorithm blocks are provided.
Open Source Code Yes Code: https://github.com/Baohao Liao/road
Open Datasets Yes To assess the efficacy of Ro Ad, we perform comprehensive evaluations on the GLUE benchmark [56], eight commonsense reasoning tasks and four arithmetic reasoning tasks, utilizing Ro BERTa [31] and LLa MA [52, 53] (Section 4.1).
Dataset Splits Yes Unlike many previous works [14, 22, 23, 31, 65] that employ the GLUE development sets for both validation and testing, here we partition the development set into distinct validation and test subsets to mitigate the risk of overfitting. For comprehensive information regarding the split of the development set, the search space of hyperparameters, the optimal hyperparameter configurations, and other details crucial for reproducibility, please see Section C.1.
Hardware Specification Yes All of our experiments are conducted on A100 80GB GPU with the frameworks, Transformers [59] and PEFT [34].
Software Dependencies No The paper mentions using 'Transformers [59] and PEFT [34]' frameworks, but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes Hyperparameter tuning. We mainly follow the hyperparameter search space of Liao et al. [22] and list them in Table C.2. Notably, we almost upscale the learning rate by 10 for Ro Ad, because Ro Ad prefers a larger learning rate than other PEFT methods, which is also observed from Liu et al. [25] and Wen and Chaudhuri [57] where their adapters also apply multiplication instead of addition. The best hyperparameter settings for each task are listed in Table C.3.