Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
A Kernel-Based View of Language Model Fine-Tuning
Authors: Sadhika Malladi, Alexander Wettig, Dingli Yu, Danqi Chen, Sanjeev Arora
ICML 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on 14 NLP tasks validate our theory and show that formulating the downstream task as a masked word prediction problem through prompting often induces kernel-based dynamics during fine-tuning. |
| Researcher Affiliation | Academia | Department of Computer Science, Princeton University, Princeton, NJ, USA. |
| Pseudocode | No | The paper does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks. |
| Open Source Code | Yes | Our code and pre-computed kernels are publicly available at https://github.com/princeton-nlp/LM-Kernel-FT. |
| Open Datasets | Yes | We consider 14 NLP tasks, divided into 8 single sentence and 6 sentence pair datasets, which cover: sentiment analysis (SST-2, SST-5, MR, CR); classifying an opinion s polarity (MQPA) or subjectivity (Subj) or question type (TREC) or news topic (AG News); natural language inference (MNLI, SNLI, QNLI, RTE); and paraphrase detection tasks (MRPC, QQP). For each task, we randomly sample 5 k-shot datasets with k training examples for each label. |
| Dataset Splits | Yes | To generate k-shot few-shot datasets, the original training data is used to randomly sample k examples per label for training and another, separate k examples per label for the validation set. |
| Hardware Specification | No | The paper mentions using a "pre-trained Ro BERTa-base (Liu et al., 2020b)" which is a model, but does not specify any hardware (e.g., GPU, CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper states: "We use functorch (He & Zou, 2021) to compute the e NTK for Ro BERTa-base". While it names a software (functorch) and cites its paper, it does not provide a specific version number for functorch itself. |
| Experiment Setup | Yes | We use value ranges given by (Gao et al., 2021) and (Hu et al., 2021), and search over a wider range of values for SGD. Table 4 shows the hyperparameter grids for fine-tuning and the kernel method. |