NTK-approximating MLP Fusion for Efficient Language Model Fine-tuning
Authors: Tianxin Wei, Zeming Guo, Yifan Chen, Jingrui He
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments of PLM fine-tuning on both natural language understanding (NLU) and generation (NLG) tasks are provided to verify the effectiveness of the proposed method MLP fusion. |
| Researcher Affiliation | Academia | 1University of Illinois Urbana-Champaign 2Hong Kong Baptist University. Correspondence to: Yifan Chen <yifanc@comp.hkbu.edu.hk>, Jingrui He <jingrui@illinois.edu>. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. The methods are described in narrative text and mathematical equations. |
| Open Source Code | Yes | Our code is available at https://github.com/weitianxin/MLP_Fusion. |
| Open Datasets | Yes | The Stanford Sentiment Treebank (Socher et al., 2013, SST2) is a corpus... The Multi-Genre Natural Language Inference Corpus (Williams et al., 2018, MNLI) consists of enormous sentence pairs... Web NLG dataset is composed of data/text pairs... |
| Dataset Splits | Yes | In SST2, the sequence lengths are on average 13.3 and max 66. 67k sentences are incorporated into the training set and 0.9k into the dev set. [...] MNLI...there are 393k pairs in the training set, 10k in the dev set. |
| Hardware Specification | Yes | The experiments are all conducted on one Tesla V100 32 GB GPU. |
| Software Dependencies | No | The paper states: "All the models in this work are implemented by Py Torch." However, it does not specify a version number for PyTorch or any other ancillary software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | For NLU tasks, We fine-tune RoBERTa (Liu et al., 2019) with an AdamW (Loshchilov & Hutter, 2018) optimizer and use a polynomial learning rate scheduler to make the learning rate linearly decay; concretely, the learning rate is linearly warmed up from 0 for the first 0.06 epoch. The learning rate is searched in the range of {1e-5, 2e-5,4e-5, 6e-5, 8e-5}, and the batch size is fixed as 32. For NLG tasks, we keep using AdamW optimizer to fine-tune GPT-2 (Radford et al., 2019), and a linear learning rate scheduler with a 500-step warmup duration is used. The learning rate is tuned in the same range as above while the batch size is fixed to 8. By default, all the compared methods reduce the MLP intermediate size to 768 or a comparable number of parameters from 3076. The reduction/sketching is performed on the last 8 layers of the PLM by default. |