Random Masking Finds Winning Tickets for Parameter Efficient Fine-tuning

Authors: Jing Xu, Jingzhao Zhang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide both empirical and theoretical explorations into the success of Random Masking. and This section presents the empirical findings of Random Masking.
Researcher Affiliation Academia 1Institute for Interdisciplinary Information Sciences, Tsinghua University, China 2Shanghai Qizhi Institute 3Shanghai AI Laboratory. Correspondence to: Jing Xu <xujing21@mails.tsinghua.edu.cn>, Jingzhao Zhang <jingzhaoz@mail.tsinghua.edu.cn>.
Pseudocode No No structured pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes Code is available at https://github.com/JingXuTHU/Random-Masking-Finds-Winning-Tickets-for-Parameter-Efficient-Fine-tuning.
Open Datasets Yes We conduct the experiments on a diverse range of datasets and tasks, including 8 datasets in the Super GLUE benchmark (Wang et al., 2019) and three additional datasets.
Dataset Splits Yes In line with the approach in Malladi et al. (2023a), we randomly sample 1000 data points from each dataset’s original training split for training, 500 data points for validation, and randomly sample 1000 data points from its original validation split for testing.
Hardware Specification No No specific hardware details (like GPU models, CPU types, or cloud instance specifications) used for running the experiments were provided in the paper.
Software Dependencies No The paper mentions using 'the spops library' and 'The Adam W optimizer' but does not specify version numbers for these or any other software dependencies.
Experiment Setup Yes We choose the Adam W optimizer with β1 = 0.9, β2 = 0.999, ε = 1e 8. We perform a grid search of learning rate from {1e 1, 1e 2, 1e 3, 1e 4, 1e 5, 1e 6}. We follow the practice of Malladi et al. (2023a) and Dettmers et al. (2023), and use a constant learning rate schedule. The number of training epochs is set to 5. The batch size is set to 8 per GPU.