Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks

Authors: Samyak Jain, Robert Kirk, Ekdeep Singh Lubana, Robert P. Dick, Hidenori Tanaka, Tim Rocktäschel, Edward Grefenstette, David Krueger

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We address this question empirically in synthetic, controlled settings where we can use mechanistic interpretability tools (e.g., network pruning and probing) to understand how the model s underlying capabilities are changing. We perform an extensive analysis of the effects of fine-tuning in these settings
Researcher Affiliation Collaboration 1University of Cambridge, UK 2University College London, UK 3EECS Department, University of Michigan, Ann Arbor, MI, USA 4Center for Brain Science, Harvard University, Cambridge, MA, USA 5Physics & Informatics Laboratories, NTT Research, Inc., Sunnyvale, CA, USA
Pseudocode Yes Algorithm 1: Pseudocode for compiling the Counter capability via Tracr: Rasp code used to generate the model for the Counter capability and task via Tracr Algorithm 2: Pseudocode for compiling the Max identifier capability via Tracr: Rasp code used to generate the model for the Max Identifier capability and task via Tracr.
Open Source Code No The paper states 'Our code is based on this repository: https://github.com/karpathy/llama2.c' in Appendix F.1 regarding Tiny Stories, which is an existing open-source project. It does not provide an explicit statement or link for the source code specific to their modifications or the general methodology described in the paper.
Open Datasets Yes Specifically, we focus on the following two setups: (i) compiled transformer models based on the Tracr library (Lindner et al., 2023; Weiss et al., 2021), which allows encoding specific computational programs into a transformer, and (ii) procedurally generated setups involving probabilistic context-free grammars (PCFGs) (Sipser, 1996; Chomsky, 1956)...Additionally, we perform analysis on language models trained on the Tiny Stories dataset to support our claims in a more realistic setup. For the Tiny Stories results, we use the Tiny Stories Instruct variant of the dataset (Eldan & Li, 2023) (see App. B.3 for an example). These models are able to follow specific instructions to write coherent English stories over multiple paragraphs. This data consists of children s stories written by GPT-3.5 and GPT-4. We use the Tiny Stories-Instruct version of this dataset https://huggingface.co/datasets/roneneldan/Tiny Stories Instruct
Dataset Splits No The paper mentions training data and test sets but does not explicitly describe distinct train/validation/test splits with percentages or counts. For PCFG, it states 'data is sampled on the fly from the data generating process during training time,' which implies a continuous generation rather than fixed splits.
Hardware Specification No The paper mentions using a 'min GPT model by Karpathy (2020)' and 'models with a similar architecture to LLa Ma 2 (Touvron et al., 2023)' but does not provide specific details on the hardware (e.g., GPU models, CPU specifications, memory) used for running the experiments.
Software Dependencies No The paper mentions using frameworks and optimizers like 'Tracr library', 'RASP library', 'min GPT model', 'Llama 2', 'SGD with momentum', and 'Adam W optimizer'. However, it does not provide specific version numbers for these software components or libraries.
Experiment Setup Yes The compiled model is fine-tuned using SGD with momentum for 10K iterations with a batch size of 96. We choose to use on SGD with momentum as the optimizer, using the following four choices of learning rates: Large LR (10 1), Medium LR (10 2), Small LR (10 3), and Very Small LR (10 4). Linear warmup is used for 2K iterations followed by a cosine schedule with a minimum learning rate of the order 10 2 smaller than its max value. For PCFG: 'Fine-tuning is done for 10K iterations using Adam W optimizer with a batch size of 96 samples. Similar to pre-training phase, we use cosine learning rate with an initial warmup of 20% of the fine-tuning iterations. The minimum value of the learning rate is set to be 100 lower than the maximum learning rate.'