A Kernel-Based View of Language Model Fine-Tuning
Authors: Sadhika Malladi, Alexander Wettig, Dingli Yu, Danqi Chen, Sanjeev Arora
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on 14 NLP tasks validate our theory and show that formulating the downstream task as a masked word prediction problem through prompting often induces kernel-based dynamics during fine-tuning. |
| Researcher Affiliation | Academia | Department of Computer Science, Princeton University, Princeton, NJ, USA. |
| Pseudocode | No | The paper does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks. |
| Open Source Code | Yes | Our code and pre-computed kernels are publicly available at https://github.com/princeton-nlp/LM-Kernel-FT. |
| Open Datasets | Yes | We consider 14 NLP tasks, divided into 8 single sentence and 6 sentence pair datasets, which cover: sentiment analysis (SST-2, SST-5, MR, CR); classifying an opinion s polarity (MQPA) or subjectivity (Subj) or question type (TREC) or news topic (AG News); natural language inference (MNLI, SNLI, QNLI, RTE); and paraphrase detection tasks (MRPC, QQP). For each task, we randomly sample 5 k-shot datasets with k training examples for each label. |
| Dataset Splits | Yes | To generate k-shot few-shot datasets, the original training data is used to randomly sample k examples per label for training and another, separate k examples per label for the validation set. |
| Hardware Specification | No | The paper mentions using a "pre-trained Ro BERTa-base (Liu et al., 2020b)" which is a model, but does not specify any hardware (e.g., GPU, CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper states: "We use functorch (He & Zou, 2021) to compute the e NTK for Ro BERTa-base". While it names a software (functorch) and cites its paper, it does not provide a specific version number for functorch itself. |
| Experiment Setup | Yes | We use value ranges given by (Gao et al., 2021) and (Hu et al., 2021), and search over a wider range of values for SGD. Table 4 shows the hyperparameter grids for fine-tuning and the kernel method. |