Massive Editing for Large Language Models via Meta Learning
Authors: Chenmien Tan, Ge Zhang, Jie Fu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our method is evaluated by editing up to thousands of facts on LMs with different architectures, i.e., BERT-base, GPT-2, T5XL (2.8B), and GPT-J (6B), across various knowledge-intensive NLP tasks, i.e., closed book fact-checking and question answering. |
| Researcher Affiliation | Collaboration | Chenmien Tan1 , Ge Zhang23 , Jie Fu4 University of Edinburgh1, University of Waterloo2, 01.AI3, HKUST4 |
| Pseudocode | Yes | Algorithm 1: Editor Inference |
| Open Source Code | Yes | Our code is available at https://github.com/Chenmien Tan/malmen. |
| Open Datasets | Yes | For BERT-base, we use the Fact Extraction and VERtification (FEVER) dataset (Thorne et al., 2018) with the identical train/val splits with De Cao et al. (2021); Mitchell et al. (2022), which contains 104,996 training and 10,444 validation samples. |
| Dataset Splits | Yes | For BERT-base, we use the Fact Extraction and VERtification (FEVER) dataset (Thorne et al., 2018) with the identical train/val splits with De Cao et al. (2021); Mitchell et al. (2022), which contains 104,996 training and 10,444 validation samples. |
| Hardware Specification | Yes | As for computation time, it takes 12.25 and 33.75 hours in total (including training) for MALMEN and MEMIT to edit 16,384 facts on GPT-J using a single NVIDIA A100 GPU, respectively. |
| Software Dependencies | No | The paper mentions optimizers like Adam and AdamW and uses various language models, but it does not specify version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used in the implementation. |
| Experiment Setup | Yes | We use identical hyper-parameter for MEND and MALMEN as follows. Rank of linear transformation in hyper-network 1920 Number of blocks in hyper-network 2 Initial learning rate 1e-6 Meta-learning rate 1e-5 Locality coefficient 1 Maximum meta gradient norm 1 |