Task-Specific Skill Localization in Fine-tuned Language Models
Authors: Abhishek Panigrahi, Nikunj Saunshi, Haoyu Zhao, Sanjeev Arora
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments suggest that localization via grafting can assist certain forms of continual learning. Our code is available at Skill-Localization-by-grafting1. |
| Researcher Affiliation | Collaboration | *Equal contribution 1Department of Computer Science, Princeton University. Correspondence to: Abhishek Panigrahi <ap34@princeton.edu>, Nikunj Saunshi <nsaunshi@google.com>. |
| Pseudocode | No | The paper describes its optimization procedure and other methods in prose and mathematical equations but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at Skill-Localization-by-grafting1. 1https://github.com/abhishekpanigrahi1996/Skill Localization-by-grafting |
| Open Datasets | Yes | We fine-tuned the pre-trained Ro BERTa-base (Liu et al., 2019b) model on 13 different tasks, with the majority from GLUE (Wang et al., 2018), including sentiment analysis, topic classification, natural language inference, and paraphrase detection datasets. |
| Dataset Splits | Yes | We make a random 95% 5% split of the training set to have a validation set for hyperparameter tuning. |
| Hardware Specification | No | The paper does not specify any particular hardware used for running the experiments, such as specific GPU models, CPU types, or cloud computing instances with detailed specifications. |
| Software Dependencies | No | The paper mentions software components like "Ro BERTa-base", "GPT-2", "SGD optimizer", and "Adam W" but does not provide specific version numbers for these or any other libraries or dependencies, which are necessary for reproducible software setup. |
| Experiment Setup | Yes | For SGD, we follow the grid {2, 4, 8} for batch size and {10 2, 5 10 3, 10 3} for learning rate and apply a small weight decay of 10 4 on all the model parameters during training. Model grafting experiments optimize Equation (3) using SGD with batch size 1024 (full-batch GD for 64-shot) for 100 steps with learning rate 107. |