Transformer-Patcher: One Mistake Worth One Neuron
Authors: Zeyu Huang, Yikang Shen, Xiaofeng Zhang, Jie Zhou, Wenge Rong, Zhang Xiong
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on both classification and generation tasks show that Transformer-Patcher can successively correct up to thousands of errors (Reliability) and generalize to their equivalent inputs (Generality) while retaining the model s accuracy on irrelevant inputs (Locality). |
| Researcher Affiliation | Collaboration | Zeyu Huang1,2, Yikang Shen4, Xiaofeng Zhang1,2, Jie Zhou5, Wenge Rong1,3, Zhang Xiong1,3 1State Key Laboratory of Software Development Environment, Beihang University, China 2Sino-French Engineer School, Beihang University, China 3School of Computer Science and Engineering, Beihang University, China 4Mila, University of Montreal, Canada, 5We Chat AI, Tencent Inc, China |
| Pseudocode | No | Appendix A describes the "Multiple Neuron Patching" principle using equations and textual explanations, but it does not provide a structured pseudocode block or algorithm. |
| Open Source Code | Yes | The code is available at https://github.com/Zero Yu Huang/Transform er-Patcher. |
| Open Datasets | Yes | For FC, we apply a BERT base model (Devlin et al., 2019) and the FEVER dataset (Thorne et al., 2018). For QA, we apply a BART base model (Lewis et al., 2020) and the Zero-Shot Relation Extraction (zs RE) dataset (Levy et al., 2017). We directly use the equivalent set released by Cao et al. (2021). |
| Dataset Splits | Yes | We first split the original Dtrain into an edit set Dedit and a new training set D train. ... For closed-book fact-checking, ... split the original training data into three subsets: a new training set D train, a new validation set Dval and an edit set Dedit in the ratio of 0.8 : 0.1 : 0.1. ... For closed-book question answering, ... employ the same data split process as FEVER in the ratio of 0.9 : 0.075 : 0.025. |
| Hardware Specification | Yes | Using a V100, one edit costs only 7.1s for FC and 18.9s for QA. ... we run SME experiment n=20 times on n different edit folders simultaneously using 8 NVIDIA Tesla V100 GPUs. |
| Software Dependencies | No | The paper mentions "Adam optimizer (Kingma & Ba, 2015) is applied for both tasks." but does not provide specific version numbers for software libraries, programming languages (e.g., Python, PyTorch, TensorFlow), or other dependencies. |
| Experiment Setup | Yes | The initial learning rate is set as 0.01. Adam optimizer (Kingma & Ba, 2015) is applied for both tasks. Every patch is initialized with the normalized corresponding query qe |qe|2. ... The parameter ka mentioned in equation 30 is set as 5, and parameter k for memory loss is set as 1000. |