Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Make Pre-trained Model Reversible: From Parameter to Memory Efficient Fine-Tuning

Authors: Baohao Liao, Shaomu Tan, Christof Monz

NeurIPS 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate MEFT on the GLUE benchmark and five question-answering tasks with various backbones, BERT, Ro BERTa, BART and OPT.
Researcher Affiliation Academia Baohao Liao Shaomu Tan Christof Monz Language Technology Lab, University of Amsterdam EMAIL
Pseudocode Yes Listing 1: Backward pass for each Layer. The peak memory happens at Line 10 or Line 25, depending on whether the subnetwork G is larger than F or the opposite. In the code, we use x1, x2, y1, y2, x1_factor, x2_factor to represent h1 n 1, h2 n 1, h1 n, h2 n, λ and β, respectively.
Open Source Code Yes Code at https://github.com/baohaoliao/mefts. Up-to-date version at https://arxiv.org/abs/2306.00477.
Open Datasets Yes We evaluate MEFTs on eight sequence representation tasks and five sequence-to-sequence tasks. All sequence representation tasks are from the GLUE benckmark [25]. The sequence-to-sequence tasks are question-answering benchmarks, including Open Book QA [44], PIQA [45], ARC (easy and challenge) [46] and Sci Q [47]. We show the statistics of these datasets in Table 8 in Appendix.
Dataset Splits Yes If the model s performance on the development set is not improved over 5 epochs, we stop the training.
Hardware Specification Yes We run all experiments on the Transformers framework [34] on a single NVIDIA RTX A6000 GPU with 48GB memory.
Software Dependencies No The paper mentions using the 'Transformers framework [34]' and 'Py Torch [52]', but it does not specify version numbers for these software components, which is required for reproducibility.
Experiment Setup Yes On the GLUE benchmark, we sweep learning rates in {3, 4, 5} 10 4, batch sizes in {16, 32} and the number of epochs in {10, 20} for the tasks with >10k training samples. For the low-resource tasks with <10k training samples, we sweep learning rates in {5, 6, 7, 8} 10 4, batch sizes in {16, 32} and the number of epochs in {20, 40}. ... For all question-answering tasks, we sweep learning rates in {1, 3, 5, 7} 10 4, batch sizes in {8, 16, 32} and the number of epochs in {3, 5, 10}...