A Kernel-Based View of Language Model Fine-Tuning

Authors: Sadhika Malladi, Alexander Wettig, Dingli Yu, Danqi Chen, Sanjeev Arora

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on 14 NLP tasks validate our theory and show that formulating the downstream task as a masked word prediction problem through prompting often induces kernel-based dynamics during fine-tuning.
Researcher Affiliation Academia Department of Computer Science, Princeton University, Princeton, NJ, USA.
Pseudocode No The paper does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks.
Open Source Code Yes Our code and pre-computed kernels are publicly available at https://github.com/princeton-nlp/LM-Kernel-FT.
Open Datasets Yes We consider 14 NLP tasks, divided into 8 single sentence and 6 sentence pair datasets, which cover: sentiment analysis (SST-2, SST-5, MR, CR); classifying an opinion s polarity (MQPA) or subjectivity (Subj) or question type (TREC) or news topic (AG News); natural language inference (MNLI, SNLI, QNLI, RTE); and paraphrase detection tasks (MRPC, QQP). For each task, we randomly sample 5 k-shot datasets with k training examples for each label.
Dataset Splits Yes To generate k-shot few-shot datasets, the original training data is used to randomly sample k examples per label for training and another, separate k examples per label for the validation set.
Hardware Specification No The paper mentions using a "pre-trained Ro BERTa-base (Liu et al., 2020b)" which is a model, but does not specify any hardware (e.g., GPU, CPU models, memory) used for running the experiments.
Software Dependencies No The paper states: "We use functorch (He & Zou, 2021) to compute the e NTK for Ro BERTa-base". While it names a software (functorch) and cites its paper, it does not provide a specific version number for functorch itself.
Experiment Setup Yes We use value ranges given by (Gao et al., 2021) and (Hu et al., 2021), and search over a wider range of values for SGD. Table 4 shows the hyperparameter grids for fine-tuning and the kernel method.