Explorations of Self-Repair in Language Models
Authors: Cody Rushing, Neel Nanda
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We strengthen this prior work by investigating self-repair across the whole pretraining distribution, by focusing on individual attention heads, and by investigating the mechanisms behind self-repair on the whole distribution. Our key findings are: 1. Direct effect self-repair is an imperfect, noisy process. It occurs across the full pretraining distribution, even while ablating individual heads (rather than full layers). |
| Researcher Affiliation | Academia | Cody Rushing 1 Neel Nanda Abstract 1University of Texas at Austin. Correspondence to: Cody Rushing <thisiscodyr@gmail.com>. |
| Pseudocode | No | The paper describes experimental procedures and definitions but does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | All the code for the experiments used in this paper is provided at https://github.com /starship006/backup_research. |
| Open Datasets | Yes | We measure the self-repair of individual attention heads in Pythia-1B (Biderman et al., 2023) on 1 million tokens of The Pile (Gao et al., 2020), the dataset used to train Pythia1B. |
| Dataset Splits | No | The paper mentions using 1 million tokens from The Pile for experiments and filters for the top 2% of tokens by direct effect, but it does not specify explicit training, validation, or test dataset splits (e.g., percentages or counts). |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It mentions using 'Center for AI Safety Compute Cluster' but without specifications. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers (e.g., PyTorch 1.9, Python 3.8). It only links to a GitHub repository for the code, which would contain such information but it's not explicitly stated in the paper text. |
| Experiment Setup | No | The paper describes its methodology for measuring direct effect and self-repair, including using resample ablations and filtering tokens. However, it does not provide specific experimental setup details such as hyperparameter values (e.g., learning rate, batch size, number of epochs) or specific training configurations for the models studied (Pythia, GPT2, Llama). These details would typically be found in an 'Experiment Setup' or 'Implementation Details' section if they were provided. |