reproducibilityindex.ai

NTK-approximating MLP Fusion for Efficient Language Model Fine-tuning

Authors: Tianxin Wei, Zeming Guo, Yifan Chen, Jingrui He

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments of PLM fine-tuning on both natural language understanding (NLU) and generation (NLG) tasks are provided to verify the effectiveness of the proposed method MLP fusion.
Researcher Affiliation	Academia	1University of Illinois Urbana-Champaign 2Hong Kong Baptist University. Correspondence to: Yifan Chen <yifanc@comp.hkbu.edu.hk>, Jingrui He <jingrui@illinois.edu>.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks. The methods are described in narrative text and mathematical equations.
Open Source Code	Yes	Our code is available at https://github.com/weitianxin/MLP_Fusion.
Open Datasets	Yes	The Stanford Sentiment Treebank (Socher et al., 2013, SST2) is a corpus... The Multi-Genre Natural Language Inference Corpus (Williams et al., 2018, MNLI) consists of enormous sentence pairs... Web NLG dataset is composed of data/text pairs...
Dataset Splits	Yes	In SST2, the sequence lengths are on average 13.3 and max 66. 67k sentences are incorporated into the training set and 0.9k into the dev set. [...] MNLI...there are 393k pairs in the training set, 10k in the dev set.
Hardware Specification	Yes	The experiments are all conducted on one Tesla V100 32 GB GPU.
Software Dependencies	No	The paper states: "All the models in this work are implemented by Py Torch." However, it does not specify a version number for PyTorch or any other ancillary software dependencies, which is required for reproducibility.
Experiment Setup	Yes	For NLU tasks, We fine-tune RoBERTa (Liu et al., 2019) with an AdamW (Loshchilov & Hutter, 2018) optimizer and use a polynomial learning rate scheduler to make the learning rate linearly decay; concretely, the learning rate is linearly warmed up from 0 for the first 0.06 epoch. The learning rate is searched in the range of {1e-5, 2e-5,4e-5, 6e-5, 8e-5}, and the batch size is fixed as 32. For NLG tasks, we keep using AdamW optimizer to fine-tune GPT-2 (Radford et al., 2019), and a linear learning rate scheduler with a 500-step warmup duration is used. The learning rate is tuned in the same range as above while the batch size is fixed to 8. By default, all the compared methods reduce the MLP intermediate size to 768 or a comparable number of parameters from 3076. The reduction/sketching is performed on the last 8 layers of the PLM by default.