Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Make Pre-trained Model Reversible: From Parameter to Memory Efficient Fine-Tuning
Authors: Baohao Liao, Shaomu Tan, Christof Monz
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate MEFT on the GLUE benchmark and five question-answering tasks with various backbones, BERT, Ro BERTa, BART and OPT. |
| Researcher Affiliation | Academia | Baohao Liao Shaomu Tan Christof Monz Language Technology Lab, University of Amsterdam EMAIL |
| Pseudocode | Yes | Listing 1: Backward pass for each Layer. The peak memory happens at Line 10 or Line 25, depending on whether the subnetwork G is larger than F or the opposite. In the code, we use x1, x2, y1, y2, x1_factor, x2_factor to represent h1 n 1, h2 n 1, h1 n, h2 n, λ and β, respectively. |
| Open Source Code | Yes | Code at https://github.com/baohaoliao/mefts. Up-to-date version at https://arxiv.org/abs/2306.00477. |
| Open Datasets | Yes | We evaluate MEFTs on eight sequence representation tasks and five sequence-to-sequence tasks. All sequence representation tasks are from the GLUE benckmark [25]. The sequence-to-sequence tasks are question-answering benchmarks, including Open Book QA [44], PIQA [45], ARC (easy and challenge) [46] and Sci Q [47]. We show the statistics of these datasets in Table 8 in Appendix. |
| Dataset Splits | Yes | If the model s performance on the development set is not improved over 5 epochs, we stop the training. |
| Hardware Specification | Yes | We run all experiments on the Transformers framework [34] on a single NVIDIA RTX A6000 GPU with 48GB memory. |
| Software Dependencies | No | The paper mentions using the 'Transformers framework [34]' and 'Py Torch [52]', but it does not specify version numbers for these software components, which is required for reproducibility. |
| Experiment Setup | Yes | On the GLUE benchmark, we sweep learning rates in {3, 4, 5} 10 4, batch sizes in {16, 32} and the number of epochs in {10, 20} for the tasks with >10k training samples. For the low-resource tasks with <10k training samples, we sweep learning rates in {5, 6, 7, 8} 10 4, batch sizes in {16, 32} and the number of epochs in {20, 40}. ... For all question-answering tasks, we sweep learning rates in {1, 3, 5, 7} 10 4, batch sizes in {8, 16, 32} and the number of epochs in {3, 5, 10}... |