Interpreting Learned Feedback Patterns in Large Language Models
Authors: Luke Marks, Amir Abdullah, Clement Neo, Rauno Arike, David Krueger, Philip Torr, Fazl Barez
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To test this, we train probes to estimate the feedback signal implicit in the activations of a fine-tuned LLM. Our objective is to study and measure this divergence. 3 Experiments and methodology |
| Researcher Affiliation | Collaboration | Apart Research Cynch.ai University of Cambridge Department of Engineering Sciences, University of Oxford |
| Pseudocode | No | The paper includes mathematical equations and descriptions of processes but does not feature any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Code for all of our experiments is available at https://github.com/apartresearch/Interpreting-Learned Feedback-Patterns. |
| Open Datasets | Yes | Our first fine-tuning task is controlled sentiment generation, in which models generate completions to prefixes from the IMDB dataset [22]. MRLHF is fine-tuned using DPO with responses from the Anthropic HH-RLHF dataset [4]. The second task uses DPO to optimize MRLHF for toxicity using the toxic-dpo dataset [36], |
| Dataset Splits | No | The paper mentions 'validation' in the context of validating probes and 'test split' for datasets, but it does not specify explicit training/validation/test dataset splits (e.g., percentages or counts) needed to reproduce the experiment. |
| Hardware Specification | Yes | For each model and task, we train for approximately 6 hours on a single A10 GPU, except for Gemma-2b-it , where we used a A40 GPU. it takes approximately four hours to train an autoencoder for all of the five high-divergence layers on a single A100. |
| Software Dependencies | No | The paper mentions software like 'Spacy [17] tokenizer', 'Transformer Reinforcement Learning (TRL) framework [39]', 'PPO', and 'Adam W optimizer', but it does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | The hyperparameters used for all models are: a batch size of 64, mini-batch size of 16, KL coefficient of 0.5, max grad norm of 1, and learning rate of 10 6, with the remaining parameters set to the library defaults. we train for 5000 steps using the Adam W optimizer with an Adam-Epsilon of 1e 8, a batch size of 8 for Pythia-70m , Pythia-160m and GPT-Neo-125m , and an effective batch size of 16 for Gemma-2b-it . The learning rate was 3e 5 for Pythia-70m , Pythia-160m and GPT-Neo-125m , and 5e 5 for Gemma-2b-it . We train all autoencoders for 75000 training examples with an ℓ1 coefficient of 0.001, a learning rate of 1e 3, and a batch size of 32. |