Task-Specific Skill Localization in Fine-tuned Language Models

Authors: Abhishek Panigrahi, Nikunj Saunshi, Haoyu Zhao, Sanjeev Arora

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments suggest that localization via grafting can assist certain forms of continual learning. Our code is available at Skill-Localization-by-grafting1.
Researcher Affiliation Collaboration *Equal contribution 1Department of Computer Science, Princeton University. Correspondence to: Abhishek Panigrahi <ap34@princeton.edu>, Nikunj Saunshi <nsaunshi@google.com>.
Pseudocode No The paper describes its optimization procedure and other methods in prose and mathematical equations but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at Skill-Localization-by-grafting1. 1https://github.com/abhishekpanigrahi1996/Skill Localization-by-grafting
Open Datasets Yes We fine-tuned the pre-trained Ro BERTa-base (Liu et al., 2019b) model on 13 different tasks, with the majority from GLUE (Wang et al., 2018), including sentiment analysis, topic classification, natural language inference, and paraphrase detection datasets.
Dataset Splits Yes We make a random 95% 5% split of the training set to have a validation set for hyperparameter tuning.
Hardware Specification No The paper does not specify any particular hardware used for running the experiments, such as specific GPU models, CPU types, or cloud computing instances with detailed specifications.
Software Dependencies No The paper mentions software components like "Ro BERTa-base", "GPT-2", "SGD optimizer", and "Adam W" but does not provide specific version numbers for these or any other libraries or dependencies, which are necessary for reproducible software setup.
Experiment Setup Yes For SGD, we follow the grid {2, 4, 8} for batch size and {10 2, 5 10 3, 10 3} for learning rate and apply a small weight decay of 10 4 on all the model parameters during training. Model grafting experiments optimize Equation (3) using SGD with batch size 1024 (full-batch GD for 64-shot) for 100 steps with learning rate 107.