Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors
Authors: Tom Hartvigsen, Swami Sankaranarayanan, Hamid Palangi, Yoon Kim, Marzyeh Ghassemi
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on T5, BERT, and GPT models show GRACE s state-of-the-art performance in making and retaining edits, while generalizing to unseen inputs. |
| Researcher Affiliation | Collaboration | Thomas Hartvigsen University of Virginia, MIT hartvigsen@virginia.edu Swami Sankaranarayanan Sony AI swami.sankaranarayanan@sony.com Hamid Palangi Microsoft Research hpalangi@microsoft.com Yoon Kim MIT yoonkim@mit.edu Marzyeh Ghassemi MIT mghassem@mit.edu |
| Pseudocode | Yes | Algorithm 1: Update Codebook at layer l. |
| Open Source Code | Yes | Our code is available at github.com/thartvigsen/grace. |
| Open Datasets | Yes | We introduce two new public benchmarks for lifelong model editing: mitigating LLM hallucination [27] and addressing label shifts [6]. We evaluate GRACE on three sequential editing tasks with corresponding pre-trained models, as shown in Table 1. 1) We edit a 60-million parameter T5 model [35] trained for context-free question-answering, as is used in [30]. We extract potential edits from the validation set of zs RE [24]... 2) We edit a 110-million BERT classifier trained for a new editing task with label shift using the SCOTUS dataset from Fairlex [6]... 3) We introduce a new editing task by correcting a GPT language models Hallucination... We finetune GPT2 on the already-accurate Hallucination data mixed with 200 random sentences from Open Web Text [10]... |
| Dataset Splits | Yes | There are 7.4k training documents from cases that took place from 1946-1982. Then, there are 914 validation documents from cases from 1982 1991. Finally there are 931 testing cases from 1991 2009. |
| Hardware Specification | Yes | We trained all methods using various GPUs including 48GB NVIDIA RTX A6000s, 40GB NVIDIA A100s, and 80GB NVIDIA A100s. Timing experiments are reported from experiments using a 48GB NVIDIA RTX A6000 GPU. |
| Software Dependencies | No | The paper mentions tools and optimizers like Adam and Huggingface ('All methods are optimized using Adam [18].', 'We finetune a 110 million BERT model (bert-base-cased on Huggingface) on the SCOTUS training using Huggingface.', 'We use Huggingface s default model trainer.'), but it does not provide specific version numbers for software libraries or environments required for reproduction. |
| Experiment Setup | Yes | All methods are optimized using Adam [18]. Since edits are singular and sequential in our setup, the batch size is always 1. For our Finetuning... we consider learning rates of 1.0, 1e 1, 1e 2, 1e 3, 1e 4, and 1e 5. ...For Adaptor methods GRACE and Defer, we found that a large learning rate of 1.0 was required... GRACE has one unique hyperparameter, ϵinit... In our main results, we set ϵinit = 0.5 for zs RE... We use 50 epochs with a batch size of 4 and a learning rate of 5e 05 with an Adam W optimizer... |