Transformer-Patcher: One Mistake Worth One Neuron

Authors: Zeyu Huang, Yikang Shen, Xiaofeng Zhang, Jie Zhou, Wenge Rong, Zhang Xiong

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on both classification and generation tasks show that Transformer-Patcher can successively correct up to thousands of errors (Reliability) and generalize to their equivalent inputs (Generality) while retaining the model s accuracy on irrelevant inputs (Locality).
Researcher Affiliation Collaboration Zeyu Huang1,2, Yikang Shen4, Xiaofeng Zhang1,2, Jie Zhou5, Wenge Rong1,3, Zhang Xiong1,3 1State Key Laboratory of Software Development Environment, Beihang University, China 2Sino-French Engineer School, Beihang University, China 3School of Computer Science and Engineering, Beihang University, China 4Mila, University of Montreal, Canada, 5We Chat AI, Tencent Inc, China
Pseudocode No Appendix A describes the "Multiple Neuron Patching" principle using equations and textual explanations, but it does not provide a structured pseudocode block or algorithm.
Open Source Code Yes The code is available at https://github.com/Zero Yu Huang/Transform er-Patcher.
Open Datasets Yes For FC, we apply a BERT base model (Devlin et al., 2019) and the FEVER dataset (Thorne et al., 2018). For QA, we apply a BART base model (Lewis et al., 2020) and the Zero-Shot Relation Extraction (zs RE) dataset (Levy et al., 2017). We directly use the equivalent set released by Cao et al. (2021).
Dataset Splits Yes We first split the original Dtrain into an edit set Dedit and a new training set D train. ... For closed-book fact-checking, ... split the original training data into three subsets: a new training set D train, a new validation set Dval and an edit set Dedit in the ratio of 0.8 : 0.1 : 0.1. ... For closed-book question answering, ... employ the same data split process as FEVER in the ratio of 0.9 : 0.075 : 0.025.
Hardware Specification Yes Using a V100, one edit costs only 7.1s for FC and 18.9s for QA. ... we run SME experiment n=20 times on n different edit folders simultaneously using 8 NVIDIA Tesla V100 GPUs.
Software Dependencies No The paper mentions "Adam optimizer (Kingma & Ba, 2015) is applied for both tasks." but does not provide specific version numbers for software libraries, programming languages (e.g., Python, PyTorch, TensorFlow), or other dependencies.
Experiment Setup Yes The initial learning rate is set as 0.01. Adam optimizer (Kingma & Ba, 2015) is applied for both tasks. Every patch is initialized with the normalized corresponding query qe |qe|2. ... The parameter ka mentioned in equation 30 is set as 5, and parameter k for memory loss is set as 1000.