Random Masking Finds Winning Tickets for Parameter Efficient Fine-tuning
Authors: Jing Xu, Jingzhao Zhang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide both empirical and theoretical explorations into the success of Random Masking. and This section presents the empirical findings of Random Masking. |
| Researcher Affiliation | Academia | 1Institute for Interdisciplinary Information Sciences, Tsinghua University, China 2Shanghai Qizhi Institute 3Shanghai AI Laboratory. Correspondence to: Jing Xu <xujing21@mails.tsinghua.edu.cn>, Jingzhao Zhang <jingzhaoz@mail.tsinghua.edu.cn>. |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | Code is available at https://github.com/JingXuTHU/Random-Masking-Finds-Winning-Tickets-for-Parameter-Efficient-Fine-tuning. |
| Open Datasets | Yes | We conduct the experiments on a diverse range of datasets and tasks, including 8 datasets in the Super GLUE benchmark (Wang et al., 2019) and three additional datasets. |
| Dataset Splits | Yes | In line with the approach in Malladi et al. (2023a), we randomly sample 1000 data points from each dataset’s original training split for training, 500 data points for validation, and randomly sample 1000 data points from its original validation split for testing. |
| Hardware Specification | No | No specific hardware details (like GPU models, CPU types, or cloud instance specifications) used for running the experiments were provided in the paper. |
| Software Dependencies | No | The paper mentions using 'the spops library' and 'The Adam W optimizer' but does not specify version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | We choose the Adam W optimizer with β1 = 0.9, β2 = 0.999, ε = 1e 8. We perform a grid search of learning rate from {1e 1, 1e 2, 1e 3, 1e 4, 1e 5, 1e 6}. We follow the practice of Malladi et al. (2023a) and Dettmers et al. (2023), and use a constant learning rate schedule. The number of training epochs is set to 5. The batch size is set to 8 per GPU. |