Make Pre-trained Model Reversible: From Parameter to Memory Efficient Fine-Tuning

Authors: Baohao Liao, Shaomu Tan, Christof Monz

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate MEFT on the GLUE benchmark and five question-answering tasks with various backbones, BERT, Ro BERTa, BART and OPT.
Researcher Affiliation Academia Baohao Liao Shaomu Tan Christof Monz Language Technology Lab, University of Amsterdam {b.liao, s.tan, c.monz}@uva.nl
Pseudocode Yes Listing 1: Backward pass for each Layer. The peak memory happens at Line 10 or Line 25, depending on whether the subnetwork G is larger than F or the opposite. In the code, we use x1, x2, y1, y2, x1_factor, x2_factor to represent h1 n 1, h2 n 1, h1 n, h2 n, λ and β, respectively.
Open Source Code Yes Code at https://github.com/baohaoliao/mefts. Up-to-date version at https://arxiv.org/abs/2306.00477.
Open Datasets Yes We evaluate MEFTs on eight sequence representation tasks and five sequence-to-sequence tasks. All sequence representation tasks are from the GLUE benckmark [25]. The sequence-to-sequence tasks are question-answering benchmarks, including Open Book QA [44], PIQA [45], ARC (easy and challenge) [46] and Sci Q [47]. We show the statistics of these datasets in Table 8 in Appendix.
Dataset Splits Yes If the model s performance on the development set is not improved over 5 epochs, we stop the training.
Hardware Specification Yes We run all experiments on the Transformers framework [34] on a single NVIDIA RTX A6000 GPU with 48GB memory.
Software Dependencies No The paper mentions using the 'Transformers framework [34]' and 'Py Torch [52]', but it does not specify version numbers for these software components, which is required for reproducibility.
Experiment Setup Yes On the GLUE benchmark, we sweep learning rates in {3, 4, 5} 10 4, batch sizes in {16, 32} and the number of epochs in {10, 20} for the tasks with >10k training samples. For the low-resource tasks with <10k training samples, we sweep learning rates in {5, 6, 7, 8} 10 4, batch sizes in {16, 32} and the number of epochs in {20, 40}. ... For all question-answering tasks, we sweep learning rates in {1, 3, 5, 7} 10 4, batch sizes in {8, 16, 32} and the number of epochs in {3, 5, 10}...