Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks
Authors: Samyak Jain, Robert Kirk, Ekdeep Singh Lubana, Robert P. Dick, Hidenori Tanaka, Tim Rocktäschel, Edward Grefenstette, David Krueger
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We address this question empirically in synthetic, controlled settings where we can use mechanistic interpretability tools (e.g., network pruning and probing) to understand how the model s underlying capabilities are changing. We perform an extensive analysis of the effects of fine-tuning in these settings |
| Researcher Affiliation | Collaboration | 1University of Cambridge, UK 2University College London, UK 3EECS Department, University of Michigan, Ann Arbor, MI, USA 4Center for Brain Science, Harvard University, Cambridge, MA, USA 5Physics & Informatics Laboratories, NTT Research, Inc., Sunnyvale, CA, USA |
| Pseudocode | Yes | Algorithm 1: Pseudocode for compiling the Counter capability via Tracr: Rasp code used to generate the model for the Counter capability and task via Tracr Algorithm 2: Pseudocode for compiling the Max identifier capability via Tracr: Rasp code used to generate the model for the Max Identifier capability and task via Tracr. |
| Open Source Code | No | The paper states 'Our code is based on this repository: https://github.com/karpathy/llama2.c' in Appendix F.1 regarding Tiny Stories, which is an existing open-source project. It does not provide an explicit statement or link for the source code specific to their modifications or the general methodology described in the paper. |
| Open Datasets | Yes | Specifically, we focus on the following two setups: (i) compiled transformer models based on the Tracr library (Lindner et al., 2023; Weiss et al., 2021), which allows encoding specific computational programs into a transformer, and (ii) procedurally generated setups involving probabilistic context-free grammars (PCFGs) (Sipser, 1996; Chomsky, 1956)...Additionally, we perform analysis on language models trained on the Tiny Stories dataset to support our claims in a more realistic setup. For the Tiny Stories results, we use the Tiny Stories Instruct variant of the dataset (Eldan & Li, 2023) (see App. B.3 for an example). These models are able to follow specific instructions to write coherent English stories over multiple paragraphs. This data consists of children s stories written by GPT-3.5 and GPT-4. We use the Tiny Stories-Instruct version of this dataset https://huggingface.co/datasets/roneneldan/Tiny Stories Instruct |
| Dataset Splits | No | The paper mentions training data and test sets but does not explicitly describe distinct train/validation/test splits with percentages or counts. For PCFG, it states 'data is sampled on the fly from the data generating process during training time,' which implies a continuous generation rather than fixed splits. |
| Hardware Specification | No | The paper mentions using a 'min GPT model by Karpathy (2020)' and 'models with a similar architecture to LLa Ma 2 (Touvron et al., 2023)' but does not provide specific details on the hardware (e.g., GPU models, CPU specifications, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using frameworks and optimizers like 'Tracr library', 'RASP library', 'min GPT model', 'Llama 2', 'SGD with momentum', and 'Adam W optimizer'. However, it does not provide specific version numbers for these software components or libraries. |
| Experiment Setup | Yes | The compiled model is fine-tuned using SGD with momentum for 10K iterations with a batch size of 96. We choose to use on SGD with momentum as the optimizer, using the following four choices of learning rates: Large LR (10 1), Medium LR (10 2), Small LR (10 3), and Very Small LR (10 4). Linear warmup is used for 2K iterations followed by a cosine schedule with a minimum learning rate of the order 10 2 smaller than its max value. For PCFG: 'Fine-tuning is done for 10K iterations using Adam W optimizer with a batch size of 96 samples. Similar to pre-training phase, we use cosine learning rate with an initial warmup of 20% of the fine-tuning iterations. The minimum value of the learning rate is set to be 100 lower than the maximum learning rate.' |